This is a fork of the script [here](https://www.kaggle.com/kinguistics/d/crowdflower/twitter-user-gender-classification/classifying-user-gender-based-on-tweet-text) - I wanted to look at some other models like Logistic Regresion and tree-based models. Also I wanted to look at the coefficients in the Logistic Regssion and see what words it finds to be predictive of different genders.
--------------------------------------------------------------------------------------------

Crowdflower's [post](https://www.crowdflower.com/using-machine-learning-to-predict-gender/) on this dataset is pretty lacking in details about what kind of model they used to predict Twitter user gender. All they say about it is "we ran the tweets through our AI feature", and that they achieved about 60% accuracy on their three-way (male, female, brand/organization) classification task.

Let's see how well we can do in a quick run-through.

I'm going to crib a lot of code from [my notebook on classifying types of news](https://www.kaggle.com/kinguistics/d/uciml/news-aggregator-dataset/classifying-news-headlines-with-scikit-learn).

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# we'll want this for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# we'll want this for text manipulation
import re

# for quick and dirty counting
from collections import defaultdict

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder

# have to use latin1 even though it results in a lot of dead characters
twigen = pd.read_csv("../input/gender-classifier-DFE-791531.csv", encoding='latin1')
twigen.head()

In [None]:
def normalize_text(s):
    # just in case
    s = str(s)
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s

twigen['text_norm'] = [normalize_text(s) for s in twigen['text']]
twigen['description_norm'] = [normalize_text(s) for s in twigen['description']]

In [None]:
twigen.shape

Let's grab some info about the gold standard and about the dataset's confidence in its gender classifications so we have some idea of what would be good to train on.

In [None]:

# how many observations are gold standard?
gold_values = defaultdict(int)
for val in twigen._golden:
    gold_values[val] += 1
print(gold_values)

# what does the confidence look like?
print(np.any(np.isnan(twigen['gender:confidence'])))
# we've got at least one NaN, so let's remove
gender_confidence = twigen['gender:confidence'][np.where(np.invert(np.isnan(twigen['gender:confidence'])))[0]]
print(len(gender_confidence))
gender_nonones = gender_confidence[np.where(gender_confidence < 1)[0]]
print(len(gender_nonones))

About 30% of the observations have less than 100% confidence in the gender classification, so we'll ignore those.

In [None]:
twigen_confident = twigen[twigen['gender:confidence']==1]
twigen_confident.shape

Let's look at the distribution of the labels:

In [None]:
gender_counts= twigen_confident['gender'].value_counts()
gender_counts/sum(gender_counts)

Okay, now let's see how well a Naive Bayes classifier can do by just looking at the words in the randomly chosen tweet.

In [None]:
# pull the data into vectors
vectorizer = TfidfVectorizer(min_df=3)
x = vectorizer.fit_transform(twigen_confident['text_norm'])

encoder = LabelEncoder()
y = encoder.fit_transform(twigen_confident['gender'])


In [None]:
encoder.classes_

In [None]:
x.shape

Let's set a random state here and stratify the validation since the classes are slightly unbalanced:

In [None]:
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.2,
                                                    stratify = y,
                                                    random_state = 4)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Alright, let's make the classifier

I am defining a function that evaulates the model on a given model and then loop over a bunch of different models:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
def eval_accuracy(model):
    model.fit(x_train, y_train)
    return model.score(x_test, y_test)

In [None]:
models = [LogisticRegression(),
          MultinomialNB(),
          RandomForestClassifier(n_estimators=50),
          KNeighborsClassifier()]

Note that here we could have spent more time tuning the parameters of the different models but the models above tend to be pretty decent directly out of the box without any fine tuning so we'll just leave them.

In [None]:
results = pd.Series([eval_accuracy(model) for model in models],
                    index = ["logit", "nb", "rf", "knn"])

In [None]:
results.plot(kind = "barh", title="Accuracy by Model")

We get pretty similar results for Random Forests, Logistic Regssison and Naive Baise with Knn giving worse results.

### A closer look at the logistic regression:

In [None]:
vectorizer = CountVectorizer(min_df=5) #5 here to get only actual words.
x = vectorizer.fit_transform(twigen_confident['text_norm'])

In [None]:
model = LogisticRegression()
model.fit(x, y)

In [None]:
encoder.classes_

### Highest male coefficients:

In [None]:
coeffs_male = pd.Series(model.coef_[2], index = vectorizer.get_feature_names())
coeffs_male.sort_values(ascending=False)[:10].plot.barh()

Ok, seems reasonable enough - we've got some politics, bromance and 420 related tokens.

In [None]:
### Highest female coefficients:

In [None]:
coeffs_female = pd.Series(model.coef_[1], index = vectorizer.get_feature_names())
coeffs_female.sort_values(ascending=False).head(10).plot.barh()

Hmm this isn't so convincing at first glance. 

### Highest brand coefficients:

In [None]:
(pd.Series(model.coef_[0], index = vectorizer.get_feature_names())
        .sort_values(ascending=False)
        .head(10)
        .plot.barh())