## Exploratory Data Analysis and Sample ML Models on the Twitter Dataset

The dataset is obtained from https://www.kaggle.com/darkknight98/twitter-data . The objective is to determine which gender commits more typos on twitter, based on approx. 20000 user tweets and other tweet/user related meta-data.

### Read on to see boiler-plate implementations of different ML algorithms applied on the dataset. I've provided implementations using just the text and without using text. Models tested include an ensemble of simple classifiers, logistic regression,Random Forest, SVC and MultinomialNB along with a basic Neural network.

# Brief Contents


* [Understanding the data](#under)
* [Nulls Checking & Removal](#nulls)
* [Regexp based text preprocessing](#regexp)
* [Encoding Cat-Features](#encoding)
* [Data Visualizations](#dv)
* [Gender Analysis with Just Text](#text)
* [Gender Analysis without Text](#notext)
    - [Ensembling](#ensembling)
    - [Simple NN](#nn)

P.S: Also find the regexp based vectorization of text, and preprocessing of tweet-text so as to make it a useful feature

<a id="under"></a>
## Understanding the Data

###  Importing the required modules 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Loading the dataset with suitable encoding

In [None]:
df=pd.read_csv("/kaggle/input/tweet_data.csv",encoding='ISO-8859-1')

In [None]:
df

### Dropping the columns which are of less significance.

In [None]:
col=["profileimage","tweet_location","user_timezone","sidebar_color","tweet_coord","link_color","fav_number","tweet_id","_last_judgment_at","created","tweet_created"]
df.drop(col,axis=1,inplace=True)

In [None]:
df
## Sanity Check ##

### Removing duplicates from the dataframe

In [None]:
df.drop_duplicates(inplace=True)

<a id="nulls"></a>
## Nulls Checking and removal

### Checking for null values across the columns

In [None]:
df.isnull().sum()

### Dropping null values in gender and gender:confidence columns as they are of no significance with gender being the dependant variable

In [None]:
c=["gender","gender:confidence"]
df.dropna(subset=c,how="any",inplace=True)

In [None]:
df

### Checking null values that still exist in the entire dataframe

In [None]:
df.isnull().sum()

### Replacing null values in text column with empty string

In [None]:
df["text"].fillna("", inplace=True)

### Replacing null values in description column with empty string

In [None]:
df["description"].fillna("",inplace=True)

<a id="regexp"></a>
# Reg-exp based text preprocessing

### Heads up: The two cells below take long to run to completion

### Removing the special characters and hyperlinks in text column using Regular Expressions

In [None]:
o=list(df["text"])
import re

l=[]
k=[]
for s in df["text"] :
    a=re.sub(r"http://t.co/[a-zA-Z0-9]*"," ",str(s))
    b=re.sub(r"https://t.co/[a-zA-Z0-9]*"," ",str(s))
    
    l.append(a)
    k.append(b)
    
df.replace(inplace=True, to_replace=o, value=l)
o=list(df["text"])
df.replace(inplace=True, to_replace=o, value=k)
    
df["text"].replace(regex=True, inplace=True, to_replace=r'[,!.; -@!%^&*)(]', value=' ')
    

### Removing the special characters and hyperlinks in description column using regular Expressions

In [None]:
o=list(df["description"])
import re

l=[]
k=[]
for s in df["description"] :
    s=re.sub(r"http://t.co/[a-zA-Z0-9]*"," ",str(s))
    s=re.sub(r"https://t.co/[a-zA-Z0-9]*"," ",str(s))
    
    l.append(s)
    k.append(s)
    
df.replace(inplace=True, to_replace=o, value=l)
df.replace(inplace=True, to_replace=o, value=k)

df["description"].replace(regex=True, inplace=True, to_replace=r'[,!.; -@!%^&*)(]', value=' ')

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.columns

### Finding the number of unique values of columns

In [None]:
df.gender.nunique()

In [None]:
df.gender.unique()

In [None]:
df._golden.unique()

In [None]:
df._unit_state.unique()

<a id="encoding"></a>
## Encoding the Cat-Features

### Converting Categorical data to Numerical data

In [None]:
num_col=df.select_dtypes(include=np.number).columns
print("Numerical Columns :\n",num_col)
cat_col=df.select_dtypes(exclude=np.number).columns
print("Categorical Columns :\n",cat_col)

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['gender']=le.fit_transform(df['gender'])#Converts brand=0,female=1,male=2,unknown=3
df['_golden']=le.fit_transform(df['_golden'])#Converts true as 1 and false as 0
df['_unit_state']=le.fit_transform(df['_unit_state'])#converts finalized as 0 and golden as 1

In [None]:
df.gender.unique()

In [None]:
df._golden.unique()

In [None]:
df._unit_state.unique()

In [None]:
df.head(10)

In [None]:
df.tail()

<a id="dv"></a>
## Data Visualization

### Histogram of gender 

In [None]:
df.gender.plot(kind='hist')


fig=plt.figure(figsize=(10,5))
plt.bar(df.gender,df.tweet_count,color='maroon',width=0.4)
plt.show()

### Bar plot against gender and tweet count

In [None]:
fig=plt.figure(figsize=(10,5))
plt.bar(df.gender,df.tweet_count,color='maroon',width=0.4)
plt.xlabel("Gender")
plt.ylabel("Tweet count")
plt.title("Tweet count based on gender")
plt.show()

###  Seaborn heatmap for a correlation matrix

In [None]:
sns.heatmap(df.corr(),annot=True,fmt='.1g',cbar=False)

In [None]:
matrix=np.triu(df.corr())
sns.heatmap(df.corr(),annot=True,mask=matrix)

<a id="text"></a>
## Gender Analysis with just Text

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df.info()

In [None]:
def normalize_text(s):
    # just in case
    s = str(s)
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s
df['text_norm'] = [normalize_text(s) for s in df['text']]
df['description_norm'] = [normalize_text(s) for s in df['description']]

In [None]:
df['all_features'] = df['text_norm'].str.cat(df['description_norm'], sep=' ')
df_confident = df[df['gender:confidence']==1] ## Choosing only the one's with confidence
df_confident.shape #Now we have approx 14000 entries.

### This is where we vectorize the tweets such that they are now a candidate for a feature

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df_confident['text_norm'])
encoder = LabelEncoder()
y = encoder.fit_transform(df_confident['gender'])

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from catboost import CatBoostClassifier
#nb = CatBoostClassifier(silent = True)

#### Multi-Nomial Naive Bayes was found to give considerably good peroformance with this data ####

nb = MultinomialNB(alpha = 0.6,fit_prior = True)
nb.fit(x_train, y_train)

print(nb.score(x_test, y_test))

Accuracy's between 54 - 59 % can be obtained with suitable tuning

In [None]:
### Just to illustrate how the text data looks like ###
df_just_text = pd.DataFrame(x)
df_just_text

<a id="notext"></a>
## Gender Analysis without Text

### Analysis using important features without text

The following section examines predictive modelling of typos even without using the 'Text' of the tweet. Let's see how linear models and GBDTs perform in this scenario.

In [None]:
X=df[['_unit_id','_golden','_unit_state','_trusted_judgments','gender:confidence','profile_yn:confidence','retweet_count','tweet_count']]

In [None]:
X.info()

In [None]:
df.corr()

In [None]:
Y=df[['gender']]

In [None]:
df_conf = df[df['gender:confidence']==1]
df_conf.shape

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size = 0.2)

<a id="ensembling"></a>
## Ensembling

I have used the following code in several of my notebooks, and i suggest the reader to use this as a template for any other ensembling tasks to save your time! I have provided a boiler plate, Naive (non-optimized) ensemble of 20+ algos here.

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process,model_selection
import xgboost
from xgboost import XGBClassifier
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    #gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    #svm.SVC(probability=True),
    #svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    
    #xgboost
    XGBClassifier()    
    ]

cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%
MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
MLA_predict = Y['gender']
row_index = 0
X1 = X.copy()
for alg in MLA:
    #print(row_index)
    X = X1
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    print('Examining ',MLA_name)
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    cv_results = model_selection.cross_validate(alg, X, Y, cv  = cv_split)
    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3
    alg.fit(X, Y)
    MLA_predict[MLA_name] = alg.predict(X)
    row_index+=1
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare

Although the accuracies are lesser than the models that use text data (which is kind of obvious), Seems like GBClassifier and XGBoost does the best on this data!

In [None]:
#### Taking 4 Ensembles ###
import warnings
warnings.filterwarnings('ignore')
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process,model_selection
import xgboost
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.GradientBoostingClassifier(),
    XGBClassifier(),
    CatBoostClassifier(verbose = False)    ## Just to see how it does! ##
    ]

cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%
MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
MLA_predict = Y['gender']
row_index = 0
X1 = X.copy()
for alg in MLA:
    X = X1
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    print(MLA_name)
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    cv_results = model_selection.cross_validate(alg, X, Y, cv  = cv_split)
    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3
    alg.fit(X, Y)
    MLA_predict[MLA_name] = alg.predict(X)
    row_index+=1
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare

In [None]:
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

Uncomment and run whichever you feel like!

In [None]:
#a=XGBClassifier(num_rounds = 150,min_split_leaf = 10,max_depth = 3,random_state=100)
#a=GradientBoostingClassifier(num_rounds = 150,min_split_leaf = 10,max_depth = 3,random_state=100)
#a=CatBoostClassifier(num_rounds = 150,min_split_leaf = 10,max_depth = 3,random_state=100)
#a=AdaBoostClassifier(num_rounds = 150,min_split_leaf = 10,max_depth = 3,random_state=100)

a = MultinomialNB()

In [None]:
a.fit(X_train,Y_train)

In [None]:
y_pred=a.predict(X_test)

In [None]:
score=accuracy_score(Y_test,y_pred)
score*100

<a id="nn"></a>
## Just trying out a simple NN for completeness

In [None]:
import tensorflow as tf

In [None]:
model=tf.keras.Sequential([
    tf.keras.layers.Dense(units=8,input_dim=X_train.shape[1],activation='relu'),
     tf.keras.layers.LeakyReLU(0.3),
    tf.keras.layers.Dense(units=1,activation='sigmoid')
])

In [None]:
model.compile(loss = 'mean_squared_error',optimizer = 'adam',metrics = ['accuracy'])

In [None]:
model.fit(X_train,Y_train,epochs=5)

In [None]:
y_pred=model.predict(X_test)

In [None]:
score=accuracy_score(Y_test,y_pred)
score*100

Well, 2 Layered-NN ? Not that great I guess!, But there's still scope for improving it. But I'll stop here as i feel i have covered enough for you to take it from here.

## So we have seen several implementations for analysis on twitter data with different algorithms using text and without the text (using just meta-data)too. Hope it was helpful!