# Twitter Sentiment Analysis

<img src="flow_chart.png" height=200px width=800px></img>

# 01 :Frame the Problem

#### Problem Statement Link :  https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/

# 02 :Obtain Data

### Import Statements

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as ms

#regular expression 
import re 

from sklearn.model_selection import train_test_split

#generating ngrams and tokens and Bagging
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

# Sequentialization of tasks
from sklearn.pipeline import Pipeline

#optimizing parameters
from sklearn.model_selection import GridSearchCV

#different classification modesls being used
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

#measuring the efficiency of our algorithms.
from sklearn.metrics import f1_score

% matplotlib inline

### Reading the Train Data

In [2]:
train = pd.read_csv('train.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
id       31962 non-null int64
label    31962 non-null int64
tweet    31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


# 03 : Analyze Data

In [3]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
train['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [5]:
train[train['label']==1]['tweet'].head()

13    @user #cnn calls #michigan middle school 'buil...
14    no comment!  in #australia   #opkillingbay #se...
17                               retweet if you agree! 
23      @user @user lumpy says i am a . prove it lumpy.
34    it's unbelievable that in the 21st century we'...
Name: tweet, dtype: object

## Label types
-   0 : Normal
-   1 : Hate

# 05 : Model Selection ( 1st Iteration)

<img src="supervised_flow_chart.png"></img>

## RandomForest without Preprocessing of Text Data

In [6]:
#Building the model without preprocessing of data
unprocessed_data = pd.read_csv('train.csv')

In [7]:
#splitting the data into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(unprocessed_data["tweet"],
                                                        unprocessed_data["label"], 
                                                    test_size = 0.2, random_state = 42)

In [8]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=200)),])

In [9]:
model = text_clf.fit(X_train,y_train)

KeyboardInterrupt: 

In [None]:
predicted = model.predict(X_test)

In [None]:
from sklearn.metrics import precision_score,recall_score

In [None]:
precision_score(y_test,predicted)

In [None]:
recall_score(y_test,predicted)

In [None]:
f1_score(y_test,predicted)

# 04 and 05 : Feature Engineering and Model Selection (2nd Iteration)

Preprocessing of Text data is very important for Textual Analysis. Tokenization, Feature Extraction (Vectorization) are the most important techniques in Scikit-Learn. 
The text must be parsed to extract words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).


In [10]:
#regular expression for the removal of name tags and the emoticons from tweets.
def process_tweet(tweet):
    return " ".join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])", " ",tweet.lower()).split())

In [11]:
#Dropping of columns from pd
def drop_features(features,data):
    data.drop(features,inplace=True,axis=1)

In [12]:
#Applying the Process_tweet function to the given Train Data
train['processed_tweets'] = train['tweet'].apply(process_tweet)

In [13]:
train.head()

Unnamed: 0,id,label,tweet,processed_tweets
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit i can t use cause they ...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in ur
4,5,0,factsguide: society now #motivation,factsguide society now motivation


In [14]:
train[train['label']==1].head(20)

Unnamed: 0,id,label,tweet,processed_tweets
13,14,1,@user #cnn calls #michigan middle school 'buil...,cnn calls michigan middle school build the wal...
14,15,1,no comment! in #australia #opkillingbay #se...,no comment in australia opkillingbay seashephe...
17,18,1,retweet if you agree!,retweet if you agree
23,24,1,@user @user lumpy says i am a . prove it lumpy.,lumpy says i am a prove it lumpy
34,35,1,it's unbelievable that in the 21st century we'...,it s unbelievable that in the 21st century we ...
56,57,1,@user lets fight against #love #peace,lets fight against love peace
68,69,1,ð©the white establishment can't have blk fol...,the white establishment can t have blk folx ru...
77,78,1,"@user hey, white people: you can call people '...",hey white people you can call people white by ...
82,83,1,how the #altright uses &amp; insecurity to lu...,how the altright uses amp insecurity to lure m...
111,112,1,@user i'm not interested in a #linguistics tha...,i m not interested in a linguistics that doesn...


In [15]:
drop_features(['id','tweet'],train)

In [16]:
train.head()

Unnamed: 0,label,processed_tweets
0,0,when a father is dysfunctional and is so selfi...
1,0,thanks for lyft credit i can t use cause they ...
2,0,bihday your majesty
3,0,model i love u take with u all the time in ur
4,0,factsguide society now motivation


In [19]:
#splitting the data into random train and test subsets
x_train, x_test, y_train, y_test = train_test_split(train["processed_tweets"],train["label"],
                                                    test_size = 0.2, random_state = 42)

Pipeline : Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. 

In [None]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=200)),])
text = text_clf.fit(x_train,y_train)

In [None]:
predicted = text.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report,precision_score

In [None]:
f1_score(y_test,predicted)

In [None]:
confusion_matrix(y_test,predicted)

In [None]:
precision_score(y_test,predicted)

In [None]:
recall_score(y_test,predicted)

# 04 and 05 : Feature Engineering and Model Selection (3rd Iteration)

In [17]:
count_vect = CountVectorizer(stop_words='english',ngram_range=(1,2),analyzer='word')
transformer = TfidfTransformer(norm='l2',sublinear_tf=True)

In [21]:
x_train_counts = count_vect.fit(x_train)
print(x_train_counts)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [23]:
x_train_counts = count_vect.fit_transform(x_train)
x_train_tfidf = transformer.fit_transform(x_train_counts)
x_test_counts = count_vect.transform(x_test)
x_test_tfidf = transformer.transform(x_test_counts)

In [32]:
print(x_train_counts.shape)
print(x_train_tfidf.shape)
print(x_test_counts.shape)
print(x_test_tfidf.shape)

(25569, 155348)
(25569, 155348)
(6393, 155348)
(6393, 155348)


In [33]:
model = SGDClassifier(loss="modified_huber", penalty="l1")
model.fit(x_train_tfidf,y_train)
predictions = model.predict(x_test_tfidf)



In [34]:
f1_score(y_test,predictions)

0.6071428571428571

# 05 : Model Selection

In [35]:
model_svc = LinearSVC(C=2.0,max_iter=500,tol=0.0001,loss ='hinge')
model_svc.fit(x_train_counts,y_train)

LinearSVC(C=2.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=500, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)

In [36]:
predict_svc = model_svc.predict(x_test_counts)

In [37]:
f1_score(y_test,predict_svc)

0.6830530401034929

# 06 : Tune the Model

In [38]:
params = {"tfidf__ngram_range": [(1, 2)],
          "svc__C": [.01, .1, 1, 10, 100]}

clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
                ("svc", LinearSVC(loss='hinge'))])

gs = GridSearchCV(clf, params, verbose=2, n_jobs=-1)
gs.fit(x_train,y_train)
print("Best Estimator = ", gs.best_estimator_)
print("Best Score = ",gs.best_score_)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   59.8s finished


Best Estimator =  Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
 ...e', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])
Best Score =  0.9630411826821542


In [39]:
predicted = gs.predict(x_test)

In [40]:
predicted

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [41]:
f1_score(y_test,predicted)

0.7245657568238213

# 07 : Predict on new cases

In [42]:
submission = pd.read_csv('test.csv')
submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
id       17197 non-null int64
tweet    17197 non-null object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB


In [43]:
submission['processed_tweet'] = submission['tweet'].apply(process_tweet)

In [44]:
submission.head()

Unnamed: 0,id,tweet,processed_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,studiolife aislife requires passion dedication...
1,31964,@user #white #supremacists want everyone to s...,white supremacists want everyone to see the ne...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your acne altwaystoheal heal...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",3rd bihday to my amazing hilarious nephew eli ...


In [45]:
drop_features(['tweet'],submission)

In [46]:
submission.head()

Unnamed: 0,id,processed_tweet
0,31963,studiolife aislife requires passion dedication...
1,31964,white supremacists want everyone to see the ne...
2,31965,safe ways to heal your acne altwaystoheal heal...
3,31966,is the hp and the cursed child book up for res...
4,31967,3rd bihday to my amazing hilarious nephew eli ...


In [47]:
predicted = gs.predict(submission['processed_tweet'])

In [48]:
predicted

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [49]:
final_predict = pd.DataFrame(predicted,columns=['label'])
result = pd.DataFrame(submission['id'],columns=['id'])
result = pd.concat([result,final_predict],axis=1)
result.to_csv('final_predictions.csv',index=False)

In [50]:
result['label'].value_counts()

0    16247
1      950
Name: label, dtype: int64