Crowdflower's [post](https://www.crowdflower.com/using-machine-learning-to-predict-gender/) on this dataset is pretty lacking in details about what kind of model they used to predict Twitter user gender. All they say about it is "we ran the tweets through our AI feature", and that they achieved about 60% accuracy on their three-way (male, female, brand/organization) classification task.

Let's see how well we can do in a quick run-through.

I'm going to crib a lot of code from [my notebook on classifying types of news](https://www.kaggle.com/kinguistics/d/uciml/news-aggregator-dataset/classifying-news-headlines-with-scikit-learn).

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# we'll want this for plotting
import matplotlib.pyplot as plt

# we'll want this for text manipulation
import re

# for quick and dirty counting
from collections import defaultdict

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder

# have to use latin1 even though it results in a lot of dead characters
twigen = pd.read_csv("gender-classifier-DFE-791531.csv", encoding='latin1')
twigen.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/5/13 1:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.5873e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/1/12 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.5873e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.5873e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,6/11/09 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.5873e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.5873e+17,,


In [2]:
def normalize_text(s):
    # just in case
    s = str(s)
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s

twigen['text_norm'] = [normalize_text(s) for s in twigen['text']]
twigen['description_norm'] = [normalize_text(s) for s in twigen['description']]


Let's grab some info about the gold standard and about the dataset's confidence in its gender classifications so we have some idea of what would be good to train on.

In [3]:

# how many observations are gold standard?
gold_values = defaultdict(int)
for val in twigen._golden:
    gold_values[val] += 1
print(gold_values)

# what does the confidence look like?
print(np.any(np.isnan(twigen['gender:confidence'])))
# we've got at least one NaN, so let's remove
gender_confidence = twigen['gender:confidence'][np.where(np.invert(np.isnan(twigen['gender:confidence'])))[0]]
print(len(gender_confidence))
gender_nonones = gender_confidence[np.where(gender_confidence < 1)[0]]
print(len(gender_nonones))

defaultdict(<class 'int'>, {False: 20000, True: 50})
True
20024


KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

About 30% of the observations have less than 100% confidence in the gender classification, so we'll ignore those.

In [None]:
twigen_confident = twigen[twigen['gender:confidence']==1]
twigen_confident.shape

Okay, now let's see how well a Naive Bayes classifier can do by just looking at the words in the randomly chosen tweet.

In [4]:
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(twigen_confident['text_norm'])

encoder = LabelEncoder()
y = encoder.fit_transform(twigen_confident['gender'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)


NameError: name 'twigen_confident' is not defined

Alright, let's make the classifier

In [5]:
nb = MultinomialNB()
nb.fit(x_train, y_train)

print(nb.score(x_test, y_test))

NameError: name 'x_train' is not defined

So we get about 58% accuracy on the "best" observations, using only tweet text.

Let's try a couple more features. Specifically, let's add the description text by concatenating it to the tweet text.

In [6]:
twigen['all_features'] = twigen['text_norm'].str.cat(twigen['description_norm'], sep=' ')

twigen_confident = twigen[twigen['gender:confidence']==1]


In [7]:
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(twigen_confident['text_norm'])

encoder = LabelEncoder()
y = encoder.fit_transform(twigen_confident['gender'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

nb = MultinomialNB()
nb.fit(x_train, y_train)

print(nb.score(x_test, y_test))

0.5861450107681263


Cool, so we gain about 2-3 percentage points in accuracy just by adding description text alongside tweet text.

You can use this kind of procedure to play around with adding more features, or try a different type of model and see how accurately you can predict gender. (Maybe also try including the less-confident observations; my exclusion of them was probably anti-conservative)