#### Twitter Sentiment Analysis

The objective of this task is to detect hate speech in tweets.

We are given a training sample of tweets and labels, where label '1' denotes negative tweet (with hate speech) and 
label '0' denotes positive tweet (with out hate speech).

Our objective is to predict the labels on the test dataset.

#### Data
Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private.

#### Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.

Precision = TP/TP+FP

Recall = TP/TP+FN

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

<b> Loading train and test data </b>

In [1]:
import os
#os.chdir(r"C:\Users\pdrva\Desktop\Divya\Kaggle\Analytics Vidya Hackthon Data\Twitter Sentiment Analysis")
import pandas as pd
train=pd.read_csv("train.csv")
#test=pd.read_csv("test_tweets.csv")
#submission=pd.read_csv("Submission.csv")

<b> Understanding data </b>

In [2]:
print("train data has {} rows and {} columns".format(train.shape[0],train.shape[1]))
#print("test data has {} rows and {} columns".format(test.shape[0],test.shape[1]))

train data has 31962 rows and 3 columns


In [3]:
print("\ntrain datatypes-->\n")
print(train.dtypes)
print("\ntest datatypes-->\n")
#print(test.dtypes)


train datatypes-->

id        int64
label     int64
tweet    object
dtype: object

test datatypes-->



In [4]:
import numpy as np
np.unique(train['label'], return_counts=True)

(array([0, 1], dtype=int64), array([29720,  2242], dtype=int64))

Notice that there is class imbalance in the data.
Only 7 percent of the training records have hate speech while the rest of the records doesn't have hate speech.

In [5]:
train.head(20)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


The data has @User mentions and #tag mentions, preprocessing step involves removing these tags.

#### Data cleaning

In [6]:
import re
def remove_handles(text,pattern):
    handles=re.findall(pattern,text)
    for handle in handles:
        text=re.sub(handle,'',text)
    return text

# Remove twitter handles
train['clean_tweet']=train.tweet.apply(lambda tweet: remove_handles(str(tweet),"@[\w]*"))

In [7]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,camping tomorrow dannyâ¦
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,welcome here ! i'm it's so #gr8 !


In [8]:
def remove_specialchars(text,pattern):
    text=re.sub(pattern,' ',text)
    return text

# Remove special characters,numbers and punctuations except " # " and " ' "
train['clean_tweet']=train.clean_tweet.apply(lambda tweet: remove_specialchars(str(tweet),"[^a-zA-Z'#]"))

In [9]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...,huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,camping tomorrow danny
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams ...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won love the land #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,welcome here i'm it's so #gr


#### Data preprocessing

In [10]:
# Tokenizing
train.clean_tweet=train.clean_tweet.apply(lambda tweet: tweet.split())

In [11]:
train.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,"[when, a, father, is, dysfunctional, and, is, ..."
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, for, #lyft, credit, i, can't, use, ca..."
2,3,0,bihday your majesty,"[bihday, your, majesty]"
3,4,0,#model i love u take with u all the time in ...,"[#model, i, love, u, take, with, u, all, the, ..."
4,5,0,factsguide: society now #motivation,"[factsguide, society, now, #motivation]"


In [12]:
# Notice that there are no upper case letters in tweets.

In [13]:
# Replace short cuts with actual words
shortcuts = {'u': 'you', 'y': 'why', 'r': 'are', 'doin': 'doing', 'hw': 'how', 'k': 'okay', 'm': 'am', 'b4': 'before',
            'idc': "i do not care", 'ty': 'thankyou', 'wlcm': 'welcome', 'bc': 'because', '<3': 'love', 'xoxo': 'love',
            'ttyl': 'talk to you later', 'gr8': 'great', 'bday': 'birthday', 'awsm': 'awesome', 'gud': 'good', 'h8': 'hate',
            'lv': 'love', 'dm': 'direct message', 'rt': 'retweet', 'wtf': 'hate', 'idgaf': 'hate',
             'irl': 'in real life', 'yolo': 'you only live once', 'ur': 'your'}
def replace_shortcuts(text):
    for i,token in enumerate(text):
        if token in shortcuts.keys():
            token=shortcuts[token]
            text[i]=token
    return text

train.clean_tweet=train.clean_tweet.apply(lambda tweet: replace_shortcuts(tweet))

In [14]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,"[when, a, father, is, dysfunctional, and, is, ..."
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, for, #lyft, credit, i, can't, use, ca..."
2,3,0,bihday your majesty,"[bihday, your, majesty]"
3,4,0,#model i love u take with u all the time in ...,"[#model, i, love, you, take, with, you, all, t..."
4,5,0,factsguide: society now #motivation,"[factsguide, society, now, #motivation]"
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, and, big, talking, before, t..."
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]"
7,8,0,the next school year is the year for exams.ð...,"[the, next, school, year, is, the, year, for, ..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[we, won, love, the, land, #allin, #cavs, #cha..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, here, i'm, it's, so, #gr]"


In [15]:
# Remove stop words
from nltk.corpus import stopwords
stop=stopwords.words('english')
train.clean_tweet=train.clean_tweet.apply(lambda tweet: [token for token in tweet if token not in stop])

In [16]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,"[father, dysfunctional, selfish, drags, kids, ..."
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, #lyft, credit, can't, use, cause, off..."
2,3,0,bihday your majesty,"[bihday, majesty]"
3,4,0,#model i love u take with u all the time in ...,"[#model, love, take, time]"
4,5,0,factsguide: society now #motivation,"[factsguide, society, #motivation]"
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, big, talking, leave, chaos, ..."
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]"
7,8,0,the next school year is the year for exams.ð...,"[next, school, year, year, exams, can't, think..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[love, land, #allin, #cavs, #champions, #cleve..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, i'm, #gr]"


In [17]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
lemtzr=WordNetLemmatizer()
train.clean_tweet=train.clean_tweet.apply(lambda tweet: [lemtzr.lemmatize(token) for token in tweet])

In [18]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,"[father, dysfunctional, selfish, drag, kid, dy..."
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, #lyft, credit, can't, use, cause, off..."
2,3,0,bihday your majesty,"[bihday, majesty]"
3,4,0,#model i love u take with u all the time in ...,"[#model, love, take, time]"
4,5,0,factsguide: society now #motivation,"[factsguide, society, #motivation]"
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, big, talking, leave, chaos, ..."
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]"
7,8,0,the next school year is the year for exams.ð...,"[next, school, year, year, exam, can't, think,..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[love, land, #allin, #cavs, #champions, #cleve..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, i'm, #gr]"


In [19]:
def hashtag_extract(text):
    tweet=(" ").join(text)
    hashtags=re.findall(r"#(\w+)",tweet)
    return hashtags

# Extract hashtags to a new column
train['hashtags']=train.clean_tweet.apply(lambda tweet: hashtag_extract(tweet))

In [20]:
def remove_hash(text,pattern):
    for i,token in enumerate(text):
        word=re.sub(pattern,'',token)
        text[i]=word
    return text

# Remove # from tweet
train['clean_tweet']=train.clean_tweet.apply(lambda tweet: remove_hash(tweet,'#'))

In [21]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet,hashtags
0,1,0,@user when a father is dysfunctional and is s...,"[father, dysfunctional, selfish, drag, kid, dy...",[run]
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, lyft, credit, can't, use, cause, offe...","[lyft, disapointed, getthanked]"
2,3,0,bihday your majesty,"[bihday, majesty]",[]
3,4,0,#model i love u take with u all the time in ...,"[model, love, take, time]",[model]
4,5,0,factsguide: society now #motivation,"[factsguide, society, motivation]",[motivation]
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, big, talking, leave, chaos, ...",[allshowandnogo]
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]",[]
7,8,0,the next school year is the year for exams.ð...,"[next, school, year, year, exam, can't, think,...","[school, exams, hate, imagine, actorslife, rev..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[love, land, allin, cavs, champions, cleveland...","[allin, cavs, champions, cleveland, clevelandc..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, i'm, gr]",[gr]


In [22]:
# Top few hate and non-hate words
def join_tokens(text):
    return (' '.join(text))

all_nonhate_tweets=' '.join(train['clean_tweet'][train['label']==0].apply(lambda tweet: join_tokens(tweet)))
all_hate_tweets=' '.join(train['clean_tweet'][train['label']==1].apply(lambda tweet: join_tokens(tweet)))

from wordcloud import WordCloud
import matplotlib.pyplot as plt
nonhate_wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_nonhate_tweets)
plt.figure(figsize=(10, 7))
plt.imshow(nonhate_wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title("Most used hate words")
plt.show()

hate_wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_hate_tweets)
plt.figure(figsize=(10, 7))
plt.imshow(hate_wordcloud, interpolation="bilinear")
plt.axis('off')
plt.title("Most used non-hate words")
plt.show()

<Figure size 1000x700 with 1 Axes>

<Figure size 1000x700 with 1 Axes>

In [23]:
from heapq import nlargest 
nlargest(10, nonhate_wordcloud.words_, key = nonhate_wordcloud.words_.get)

['love',
 'amp',
 'day',
 'today',
 'happy',
 'life',
 'new',
 'time',
 'people',
 'smile']

In [24]:
from heapq import nlargest 
nlargest(10, hate_wordcloud.words_, key = hate_wordcloud.words_.get)

['amp',
 'trump',
 'white',
 'woman',
 'allahsoil',
 'black',
 'hate',
 'racist',
 'might libtard',
 'libtard libtard']

#### Feature Extraction

In [25]:
train['docs']=train['clean_tweet'].apply(lambda tweet: join_tokens(tweet))

In [26]:
train.head()

Unnamed: 0,id,label,tweet,clean_tweet,hashtags,docs
0,1,0,@user when a father is dysfunctional and is s...,"[father, dysfunctional, selfish, drag, kid, dy...",[run],father dysfunctional selfish drag kid dysfunct...
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, lyft, credit, can't, use, cause, offe...","[lyft, disapointed, getthanked]",thanks lyft credit can't use cause offer wheel...
2,3,0,bihday your majesty,"[bihday, majesty]",[],bihday majesty
3,4,0,#model i love u take with u all the time in ...,"[model, love, take, time]",[model],model love take time
4,5,0,factsguide: society now #motivation,"[factsguide, society, motivation]",[motivation],factsguide society motivation


In [27]:
docs=[]
for doc in train.docs:
    docs.append(doc)

In [29]:
# Use CountVectorizer to extract features
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# tokenize and build vocab
vector=vectorizer.fit_transform(docs)

# Convert array format to dataframe
dense=vector.todense()
df=pd.DataFrame(dense,columns=vectorizer.get_feature_names())

#### Model Building

In [30]:
# Train, validation split
from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(df, train.label, test_size=0.20, random_state=12)

#### Logistic Regression

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
lreg = LogisticRegression()
lreg.fit(train_x, train_y)
pred_y = lreg.predict(val_x)
print('Accuracy:', accuracy_score(pred_y, val_y))
print("F1 Score: ", f1_score(pred_y, val_y))
# Predictions using non-default threshold value
pred_y = lreg.predict_proba(val_x)
prediction_int1 = pred_y[:,1] >= 0.27
prediction_int1 = prediction_int1.astype(np.int)
print('Accuracy:', accuracy_score(prediction_int1, val_y))
print("F1 Score: ", f1_score(prediction_int1, val_y))

Accuracy: 0.9615204129516659
F1 Score:  0.6684636118598383
Accuracy: 0.9613639918661036
F1 Score:  0.7177142857142857


#### Linear SVC

In [33]:
from sklearn.svm import LinearSVC
model_svc = LinearSVC(penalty='l2', C=0.5, dual=False, random_state=0, max_iter=1000)
print(model_svc)
model_svc.fit(train_x,train_y)
pred_y = model_svc.predict(val_x)
print('Accuracy:', accuracy_score(pred_y, val_y))
print("F1 Score: ", f1_score(pred_y, val_y))

LinearSVC(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0)
Accuracy: 0.9638667292351009
F1 Score:  0.7130434782608697


#### Naive Bayes

In [34]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(train_x, train_y)
pred_y = nb.predict_proba(val_x)
prediction_int = pred_y[:,1] >= 0.5
prediction_int = prediction_int.astype(np.int)
f1_score(val_y, prediction_int)

0.38125802310654683

#### Preparing test data

In [45]:
del train, docs, train_x, train_y, val_x, val_y
import gc
gc.collect()

72

In [47]:
test = pd.read_csv('test_tweets.csv')

In [48]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [49]:
# Remove twitter handles
test['clean_tweet']=test.tweet.apply(lambda tweet: remove_handles(str(tweet),"@[\w]*"))

In [50]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","3rd #bihday to my amazing, hilarious #nephew..."


In [51]:
# Remove special characters,numbers and punctuations except " # " and " ' "
test['clean_tweet']=test.clean_tweet.apply(lambda tweet: remove_specialchars(str(tweet),"[^a-zA-Z'#]"))

In [52]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",rd #bihday to my amazing hilarious #nephew...


In [53]:
# Tokenizing
test.clean_tweet=test.clean_tweet.apply(lambda tweet: tweet.split())

In [54]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,"[#studiolife, #aislife, #requires, #passion, #..."
1,31964,@user #white #supremacists want everyone to s...,"[#white, #supremacists, want, everyone, to, se..."
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, ways, to, heal, your, #acne, #altwaysto..."
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u..."
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, #bihday, to, my, amazing, hilarious, #nep..."


In [56]:
# Notice that there are no upper case letters in tweets.

In [57]:
# Replace short cuts with actual words
test.clean_tweet=test.clean_tweet.apply(lambda tweet: replace_shortcuts(tweet))

In [58]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,"[#studiolife, #aislife, #requires, #passion, #..."
1,31964,@user #white #supremacists want everyone to s...,"[#white, #supremacists, want, everyone, to, se..."
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, ways, to, heal, your, #acne, #altwaysto..."
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u..."
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, #bihday, to, my, amazing, hilarious, #nep..."


In [59]:
# Remove stop words
test.clean_tweet=test.clean_tweet.apply(lambda tweet: [token for token in tweet if token not in stop])

In [60]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,"[#studiolife, #aislife, #requires, #passion, #..."
1,31964,@user #white #supremacists want everyone to s...,"[#white, #supremacists, want, everyone, see, n..."
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, ways, heal, #acne, #altwaystoheal, #hea..."
3,31966,is the hp and the cursed child book up for res...,"[hp, cursed, child, book, reservations, alread..."
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, #bihday, amazing, hilarious, #nephew, eli..."


In [61]:
# Lemmatization
test.clean_tweet=test.clean_tweet.apply(lambda tweet: [lemtzr.lemmatize(token) for token in tweet])

In [62]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,"[#studiolife, #aislife, #requires, #passion, #..."
1,31964,@user #white #supremacists want everyone to s...,"[#white, #supremacists, want, everyone, see, n..."
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, way, heal, #acne, #altwaystoheal, #heal..."
3,31966,is the hp and the cursed child book up for res...,"[hp, cursed, child, book, reservation, already..."
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, #bihday, amazing, hilarious, #nephew, eli..."


In [63]:
# Extract hashtags to a new column
test['hashtags']=test.clean_tweet.apply(lambda tweet: hashtag_extract(tweet))

In [64]:
# Remove # from tweet
test['clean_tweet']=test.clean_tweet.apply(lambda tweet: remove_hash(tweet,'#'))

In [65]:
test.head()

Unnamed: 0,id,tweet,clean_tweet,hashtags
0,31963,#studiolife #aislife #requires #passion #dedic...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic..."
1,31964,@user #white #supremacists want everyone to s...,"[white, supremacists, want, everyone, see, new...","[white, supremacists, birds, movie]"
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, way, heal, acne, altwaystoheal, healthy...","[acne, altwaystoheal, healthy, healing]"
3,31966,is the hp and the cursed child book up for res...,"[hp, cursed, child, book, reservation, already...","[harrypotter, pottermore, favorite]"
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, bihday, amazing, hilarious, nephew, eli, ...","[bihday, nephew]"


In [66]:
test['docs']=test['clean_tweet'].apply(lambda tweet: join_tokens(tweet))

In [67]:
test.head()

Unnamed: 0,id,tweet,clean_tweet,hashtags,docs
0,31963,#studiolife #aislife #requires #passion #dedic...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic...",studiolife aislife requires passion dedication...
1,31964,@user #white #supremacists want everyone to s...,"[white, supremacists, want, everyone, see, new...","[white, supremacists, birds, movie]",white supremacists want everyone see new birds...
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, way, heal, acne, altwaystoheal, healthy...","[acne, altwaystoheal, healthy, healing]",safe way heal acne altwaystoheal healthy healing
3,31966,is the hp and the cursed child book up for res...,"[hp, cursed, child, book, reservation, already...","[harrypotter, pottermore, favorite]",hp cursed child book reservation already yes h...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, bihday, amazing, hilarious, nephew, eli, ...","[bihday, nephew]",rd bihday amazing hilarious nephew eli ahmir u...


In [68]:
test_docs=[]
for doc in test.docs:
    test_docs.append(doc)

In [69]:
# transform count vectorizer on test
test_vector=vectorizer.transform(test_docs)

In [70]:
# Convert array format to dataframe
test_dense=test_vector.todense()
test_df=pd.DataFrame(test_dense,columns=vectorizer.get_feature_names())

In [71]:
test_df.head()

Unnamed: 0,aa,aaa,aaaaa,aaaaaand,aaaaah,aaaaand,aaahh,aaahhhh,aaahhhhh,aaand,...,zulu,zuma,zumba,zurich,zx,zydeco,zz,zzz,zzzzzz,zzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Making predictions on test data with best model.

In [72]:
submission = pd.read_csv('Submission.csv')

In [73]:
submission.head()

Unnamed: 0,id,label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [76]:
test_pred_y = lreg.predict_proba(test_df)
test_prediction_int1 = test_pred_y[:,1] >= 0.27
submission['label'] = test_prediction_int1.astype(np.int)

In [78]:
submission.to_csv('Submission_lreg.csv', index= False)