# <span style="color:#0b486b">  Gender Classification Based on Tweets</span>

## <span style="color:#0b486b">Sebastian Guerra </span>

### Project description

The aim of this project is to develop an algorithm capable of predicting the people's gender based on tweets. For the first part, NLP is used in order to clean and obtain the important words. Libraries used:
<br><br>-Gensim<br>
-Spacy<br>
-Pandas<br>
-Emoji<br>

Later, machine learning algorithms were developed to try to achieve the highest accuracy. Withing the models used are:<br>

-RandomForest<br>
-SVC<br>
-A mix of Doc2Vec with SVC

Packages required to running this file

In [None]:
# !pip install emoji
# !pip install spacy
# !pip install pandas
# !pip install gensim
# !pip install scikit-learn
# !python -m spacy download en_core_web_sm

#### The following libraries are used for this Assignment

In [71]:
# for Preprocessing
import os
import csv
import xml.etree.ElementTree as ET
from itertools import chain
import spacy
import re
import pickle
import emoji
from emoji.unicode_codes import UNICODE_EMOJI
import pandas as pd
import multiprocessing

#Gensim package (for Document Vectorization and Bigram generation)
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models import Phrases
from gensim.corpora import Dictionary

#Sklearn for model building
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.base import BaseEstimator
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


In [2]:
nlp = spacy.load('en_core_web_sm')
path = 'data'

In [3]:
# !unzip "data.zip"

All the tweets from the `XML` files are extracted along with the author ID and are added to a tuple for further processing

In [4]:
tokens_author_dict = {}
documents_data = []
for file_name in tqdm(os.listdir(path), desc = "Document Parsing"): # Iterates through all files present in the data directory
    if ".xml" in file_name:
        root = ET.parse('data/' + file_name).getroot()

        post_sentences = [] # Extract the all the post sentences and append it to the the list
        for type_tag in root.findall('documents/document'):
            post_sentences.append(type_tag.text)

        corpus = " ".join(post_sentences) # merges all the tweets for a person into single document
        auth_id = file_name.replace('.xml', '')
        documents_data.append((auth_id, corpus)) # the document and its corresponding author id is added to the final list

Document Parsing: 100%|████████████████████████████████████████████████████████████| 3601/3601 [01:09<00:00, 51.97it/s]


Data frame containing list of all Authors along with the 

In [5]:
documents_data = pd.DataFrame(data=documents_data, columns=["Author_ID", "Documents"])

The **Preprocessing** function accepts the document`(contains all the tweets for a single user)` for each user and cleans the
`urls`, `html tags`, `twitter users`, `puntuations`, `stopwords`, `digits`, `brackets` and `currency` symbols if there are any present in the document

In [8]:
url_regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
def Preprocessing(data):
    """
    reads the documents and apply all the necessary preprocessing and finally returns the lemmatized tokens
    """
    doc = re.sub("[\n]*", "", data, flags=re.MULTILINE) #filters newlines
    doc = re.sub("@[\w]*", "", doc) # filter all the users
    doc  = re.sub("&nbsp;| &amp;| &lt; | &gt;| - ", "", doc) # filters all html tags
    doc = re.sub("✔", "", doc) # filters the verified user symbol
    doc = re.sub(url_regex, "", doc) # filters the url
    doc = nlp(doc)
    filtered_tokens = []
    for token in doc: # spacy tokenizer is used here to do tokenization as well as the following preprocessing
        if not (token.is_stop | token.is_punct | token.is_bracket | token.is_digit | token.is_currency):
            filtered_tokens.append(token.lemma_) # lemmatizes the final tokens before adding them to the final list
    return filtered_tokens

The **postprocessing** function filters all the tokens which have  size less than 3 as these tokens doesn't add any meaning to the classification task and the whitespaces created during preprocessing stage were removed during the stage

In [9]:
def postprocessing(row):
    """
    filters the tokens with size less than 3 and removes the extra white spaces
    """
    result = []
    for token in row:
        if len(token)<3: # filters the less token with size less than 3
            continue
        if re.match("[\s]+", token): # filters white spaces
            continue
        result.append(token)
    return result

applies the prep and post processing functions to the `documents`. all the unicode characters i.e.(emojis) were also filtered from the tokens

In [15]:
documents_data["Documents"] = documents_data["Documents"].apply(lambda x: Preprocessing(x))
documents_data["Documents"] = documents_data["Documents"].apply(lambda x: [ "" if (token in UNICODE_EMOJI) else token for token in x])
documents_data["Documents"] = documents_data["Documents"].apply(lambda x: postprocessing(x))

Training data is extracted from the Document corpus based on the `Author_ID`

In [18]:
train_data = pd.merge(documents_data, train_labels, left_on="Author_ID", right_on="id", copy=False).drop(labels = ["Author_ID"], axis = 1)

In [10]:
result_tag =  {"male":1, "female":0}

In [None]:
test = pd.read_csv("test_labels.csv")
test_data = pd.merge(documents_data, test, left_on="Author_ID", right_on="id", copy=False).drop(labels = ["Author_ID"], axis = 1)

## First model

The first model which we have created is based on the combination of TF-IDF and Random Forest Classifier


Creating a Tfidf Vectorizer to convert all tokens into a vector.

In [None]:
# Initializing the TfidfVectorizer to disregard words more frequent than 95% and lesser than count 5 and also with bigrams and trigrams.
vectorizer = TfidfVectorizer(decode_error="ignore", 
                             max_df = 0.95, 
                             min_df = 5,
                             max_features=1000,
                             token_pattern=None,
                             tokenizer = lambda x: x,
                             preprocessor = lambda x: x,
                             ngram_range=(1,3)
                             )

#### Preparing Training and Test Data

In [None]:
# Fitting the tfidf vectorizer over the training documents
X = vectorizer.fit_transform(train_data.Documents)

In [None]:
X_test = vectorizer.transform(test_data.Documents)

#### Building Models

The first model that we're training is a RandomForestClassifier. In order to find the optimal parameters for this classifier model, we use grid search .

In [None]:
# Initializing Gridsearch for Randomforest
param_grid = {'n_estimators':[3500, 4000, 5000, 6000],'criterion':['gini']}

gridVectorizer = GridSearchCV(RandomForestClassifier(),param_grid,refit = True, verbose=2, cv=2)

gridVectorizer.fit(X, train_data.gender.map(result_tag))

{'criterion': 'gini', 'n_estimators': 3500}

Using the above optimal parameters, we then train a randomforest model.

In [None]:
# Initializing a Random Forest model with the optimal values found after running a gridsearchcv.
clf = RandomForestClassifier(bootstrap=True, n_estimators=3500)
# clf = pickle.load(open("finalized_rf_tfidf_model.sav", 'rb'))

In [None]:
# Fitting the training dataset on the randomforest model
clf.fit(X, train_data.gender.map(result_tag))

The Test accuracy for the Random Forest Classifer Model is

In [71]:
clf.score(X_test, test_data.gender.map(result_tag))

0.768

## Second model

The second model which we have created is based on the combination of TF-IDF and SVC


Next, we decide to attempt SVC model. In order to find the optimal value, we run a grid search on the SVC model.

In [None]:
# Initializing Gridsearch for SVC

param_grid = {'C':[ 0.5, 1, 2,10],'degree':[1, 2, 3, 10], 'kernel':['linear','rbf']}

gridSVCVectorizer = GridSearchCV(SVC(),param_grid,refit = True, verbose=2, cv=2)

gridSVCVectorizer.fit(X, train_data.gender.map(result_tag))

The Best Parameters We have got from the above run is

`C`: **1**, 

`degree`: **1**, 

`kernel`: **rbf**

Using the above optimal parameters, we then train a SVC model.

In [None]:
# Initializing a SVC model with the optimal values found after running a gridsearchcv.
model_svc = SVC(kernel="rbf", C=1, degree=1)
# model_svc = pickle.load(open("svc_tfidf_model.sav", 'rb'))
model_svc.fit(X, train_data.gender.map(result_tag))

In [25]:
model_svc.score(X_test, test_data.gender.map(result_tag))

0.77

Out of the 2 models `(Random Forest Classifier & Support Vector Classifier)` which were built on top of TF-IDF the second model gave us the best results. i.e. The combination of **TF-IDF and SVC**. But since the improvement was not significant, we began exploring our options in how to vectorize the documents differently in order to generate better trends.

In order to further improve the classification accuracy we have used the below model using Doc2Vec vectorizer in the gensim package. 

## Third model

#### The Third model which we have created is based on the combination of Doc2Vec and SVC

* Since for the above two models we have used TF-IDF inbuilt n-grams for the model, in this method we have used gensim `Phrases`
* The `Phrases` will adds the bigrams to the documents corpus for each document iteratively 
* The min_count parameter is used to filter less frequent bigrams and tokens from the corpus

In [16]:
bigram = Phrases(documents_data.Documents, min_count=5) # creates bigrams from the Document Corpus
for idx in range(len(documents_data.Documents)): # iterates over all the documents
    for token in bigram[documents_data.Documents[idx]]: 
        if '_' in token: # if the token is a bigram it will add it back to the respective document from which it was generated
            documents_data.Documents[idx].append(token) 

All the cleaned documents were saved to a file for further processing

In [17]:
documents_data.to_csv("cleaned_Documents.csv", index=False)

In [7]:
train_labels = pd.read_csv("train_labels.csv")

Training data is extracted from the Document corpus based on the `Author_ID`

In [18]:
train_data = pd.merge(documents_data, train_labels, left_on="Author_ID", right_on="id", copy=False).drop(labels = ["Author_ID"], axis = 1)

The Doc2Vec_Model class is an adapter for gensim Doc2vec for the compatability with scikitlearn GridsearchCV

In [24]:
class Doc2Vec_Model(BaseEstimator):
    """
    Since gensim Doc2vec model is not directly compatible to run on sckitlearn GridSearchCV. The Doc2Vec_Model class acts as an adapter for the
    gensim Doc2vec for compatibility. Scikitlearn learn BaseEstimator is subclassed to
    """

    def __init__(self, vector_size = 100, window = 5):
        """
        In the init we are Initializing all the parameters required by the Doc2vec model 
        """
        self.epoch = 10 # number of epochs the doc2vec will train on the input data
        self.vector_size = vector_size # dimension of the output vector
        self.hs= 1 # for softmax output
        self.window = window # window size of doc2vec 
        self.dm = 0 #Paragraph Vector - Distributed Bag of Words

    def fit(self, tagged_documents, y=None):
        """
        This function is used to train the Doc2vec model on the input data
        """
        
        # Initialize model
        self.d2v_model = Doc2Vec(dm=self.dm, vector_size = self.vector_size, window = self.window, hs = self.hs, alpha=0.025, min_alpha=0.001)
        
        # Build vocabulary
        self.d2v_model.build_vocab([x for x in tagged_documents])
        # Model training
        self.d2v_model.train(tagged_documents, total_examples=len(tagged_documents), epochs=self.epoch)
        return self

    def transform(self, data):
        """
        This function will transform the input documents into their corresponding vectors
        """
        # the model infer_vector function will return the document vector for the input data 
        _, vector = zip(*[(doc.tags[0], self.d2v_model.infer_vector(doc.words, steps=20)) for doc in data])
        return  vector

    def fit_transform(self, docs, y=None):
        """
        This function will run both the fit and transform steps and returns the transformed data
        """
        self.fit(docs)
        return self.transform(docs)

#Hypter parameters which gridsearchcv use for tuning the model
param_grid = {'doc2vec__window': [3, 6, 9], # window size of doc2vec model
              'doc2vec__vector_size': [300,500], # final vector size of each document
              'svc__kernel':['linear','rbf'], # different kernels used by svc
              'svc__C':[0.3, 0.6, 0.9] # regularization for svc
                        
}

#gridsearchcv pipeline for Doc2vec and SVC
pipe_svc = Pipeline([('doc2vec', Doc2Vec_Model()), ('svc', SVC(kernel='linear'))])

# 
svc_grid = GridSearchCV(pipe_svc,  # 
                        param_grid=param_grid,
                        scoring="accuracy",
                        verbose=10, 
                        cv = 2)

result_tag =  {"male":1, "female":0} # mapping the gender with binary labels
train = []

# TaggedDocument representation of all the train data is added to the list 
# since this representation is used by Doc2vec for model building
for index, row in train_data.iterrows(): # 
    train.append(TaggedDocument(words=row["Documents"], tags=[result_tag.get(row["gender"], -1)]))


fitted = svc_grid.fit(train, train_data.gender)

# Best parameters
print("Best Parameters: {}\n".format(svc_grid.best_params_))
print("Best accuracy: {}\n".format(svc_grid.best_score_))

Fitting 2 folds for each of 36 candidates, totalling 72 fits
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=linear 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=linear, score=0.793, total= 2.2min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=linear 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.2min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=linear, score=0.775, total= 2.2min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=rbf 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.4min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=rbf, score=0.797, total= 2.2min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=rbf 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  6.6min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.3, svc__kernel=rbf, score=0.778, total= 2.2min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=linear 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  8.8min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=linear, score=0.786, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=linear 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 11.1min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=linear, score=0.771, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=rbf 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 13.4min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=rbf, score=0.793, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=rbf 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 15.7min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.6, svc__kernel=rbf, score=0.783, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=linear 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 18.0min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=linear, score=0.787, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=linear 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 20.3min remaining:    0.0s


[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=linear, score=0.766, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=rbf 
[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=rbf, score=0.799, total= 2.2min
[CV] doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=rbf 
[CV]  doc2vec__vector_size=300, doc2vec__window=3, svc__C=0.9, svc__kernel=rbf, score=0.778, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=6, svc__C=0.3, svc__kernel=linear 
[CV]  doc2vec__vector_size=300, doc2vec__window=6, svc__C=0.3, svc__kernel=linear, score=0.795, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=6, svc__C=0.3, svc__kernel=linear 
[CV]  doc2vec__vector_size=300, doc2vec__window=6, svc__C=0.3, svc__kernel=linear, score=0.769, total= 2.3min
[CV] doc2vec__vector_size=300, doc2vec__window=6, svc__C=0.3, svc__kernel=rbf 
[CV]  doc2vec__vector_size=300, doc2vec__window=6, svc_

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed: 196.6min finished


Best Parameters: {'doc2vec__vector_size': 500, 'doc2vec__window': 3, 'svc__C': 0.6, 'svc__kernel': 'rbf'}

Best accuracy: 0.7987096774193548



### Training the model with best parameters

The `final_model` was built based on the best parameters selected from the GridSearchCV

In [35]:
final_model = Doc2Vec_Model(vector_size=500, window=3)
#final_model = pickle.load(open("doc2vec_model.sav", "rb"))

In [39]:
X_train = final_model.fit_transform(train) 

In [64]:
test = pd.read_csv("test_labels.csv")
test_data = pd.merge(documents_data, test, left_on="Author_ID", right_on="id", copy=False).drop(labels = ["Author_ID"], axis = 1)

In [65]:
result_tag =  {"male":1, "female":0}
test_labs = []
for index, row in test_data.iterrows():
    test_labs.append(TaggedDocument(words=row["Documents"], tags=[result_tag.get(row["gender"], -1)]))

In [42]:
model_svc = SVC(C=0.6, kernel='rbf')
#model_svc = pickle.load(open("svc_doc2vec_model.sav", "rb"))
model_svc.fit(X_train, train_data.gender)

SVC(C=0.6, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Transfroming the test data on the trained model

In [44]:
X_test = final_model.transform(test_labs)

Train accuracy for SVC

In [43]:
model_svc.score(X_train, train_data.gender)

0.9848387096774194

Test Accuracy for SVC

In [78]:
model_svc.score(X_test, test_data.gender)

0.81

As evident from the score above, the SVC model trained on top of Doc2Vec vectors gives us a much better accuracy of 81% than the models we have tried before.

All the predicted labels for the test data are saved to a `pred_labels.csv` file

In [75]:
gender = model_svc.predict(X_test)
test_data.gender = test_data.gender

In [77]:
test_data.drop(labels=["Documents"], inplace=True, axis = 1)
test_data.to_csv("pred_labels.csv", index = False)