In this lab, we will
- read our project data into a Pandas DataFrame
- write a function to compute simple features for each row of the data frame
- fit a LogisticRegression model to the data
- print the top coefficients
- compute measures of accuracy

I've given you starter code below. You should:
- First, try to get it to work with your data. It may require changing the load_data file to match the requirements of your data (e.g., what is the object you are classifying -- a tweet, a user, a news article?)
- Second, you should add additional features to the make_features function:
  - Be creative. It could be additional word features, or other meta data about the user, date, etc.
- As you try out different feature combinations, print out the coefficients and accuracy scores
- List any features that seem to improve accuracy. Why do you think that is?

In [33]:
from collections import Counter
import numpy as np
import pandas as pd
import re
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer

In [41]:
def load_data(datafile):
    """
    Read your data into a single pandas dataframe where
    - each row is an instance to be classified
    (this could be a tweet, user, or news article, depending on your project)
    - there is a column called `label` which stores the class label (e.g., the true
      category for this row)
    """
    df = pd.read_csv(datafile)[['text', 'hostile']]
    df.columns = ['text', 'label']
    df['label'] = ['hostile' if i==1 else 'nonhostile' for i in df.label]
    return df

df = load_data('~/Dropbox/elevate/harassment/training_data/data.csv.gz')
df.head()

Unnamed: 0,text,label
0,@FlyGuyCree Nigga whatever one you gave me 🤦🏻‍♀️,nonhostile
1,@ArvindKejriwal . go to hell you ass hole.,hostile
2,@JohnJohnDaDon That “nigga” done lost his fuck...,hostile
3,@kane_tingle10 Can’t be fucked with them mate....,nonhostile
4,@JHarris_TheDon Its honestly better anyways. T...,nonhostile


In [42]:
# what is the distribution over class labels?
df.label.value_counts()

hostile       3588
nonhostile    3186
Name: label, dtype: int64

In [43]:
def make_features(df):
    vec = DictVectorizer()
    feature_dicts = []
    # just as an initial example, we will consider three
    # word features in the model.
    words_to_track = ['you', 'hate', 'love']
    for i, row in df.iterrows():
        features = {}
        token_counts = Counter(re.sub('\W+', ' ', row['text'].lower()).split())
        for w in words_to_track:
            features[w] = token_counts[w]
        feature_dicts.append(features)
    X = vec.fit_transform(feature_dicts)
    return X, vec
                
X, vec = make_features(df)

In [44]:
# what are dimensions of the feature matrix?
X.shape

(6774, 3)

In [45]:
# what are the feature names?
# vocabulary_ is a dict from feature name to column index
vec.vocabulary_

{'you': 2, 'hate': 0, 'love': 1}

In [46]:
# how often does each word occur?
for word, idx in vec.vocabulary_.items():
    print('%20s\t%d' % (word, X[:,idx].sum()))

                 you	2622
                hate	44
                love	129


In [59]:
# can also get a simple list of feature names:
vec.get_feature_names()
# e.g., first column is 'hate', second is 'love', etc.

['hate', 'love', 'you']

In [47]:
# we'll first store the classes separately in a numpy array
y = np.array(df.label)
Counter(y)

Counter({'nonhostile': 3186, 'hostile': 3588})

In [58]:
# to find the row indices with hostile label
np.where(y=='hostile')[0]

array([   1,    2,    5, ..., 6769, 6771, 6773])

In [49]:
# store the class names
class_names = set(df.label)

In [51]:
# how often does each word appear in each class?
for word, idx in vec.vocabulary_.items():
    for class_name in class_names:
        class_idx = np.where(y==class_name)[0]
        print('%20s\t%20s\t%d' % (word, class_name, X[class_idx, idx].sum()))

                 you	             hostile	1690
                 you	          nonhostile	932
                hate	             hostile	20
                hate	          nonhostile	24
                love	             hostile	44
                love	          nonhostile	85


So, `you` appears more frequently in positive (hostile) class, and `love` appears more frequently in the negative (non-hostile) class.

In [67]:
# fit a LogisticRegression classifier.
clf = LogisticRegression(solver='lbfgs', multi_class='auto')
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [69]:
# for binary classification, LogisticRegression stores a single coefficient vector
clf.coef_
# this would be a matrix for a multi-class probem.

array([[ 0.29751405,  0.83802711, -0.31400639]])

In [90]:
# for binary classification, the coefficients for the negative class is just the negative of the positive class.
coef = [-clf.coef_[0], clf.coef_[0]]
print(coef)

[array([-0.29751405, -0.83802711,  0.31400639]), array([ 0.29751405,  0.83802711, -0.31400639])]


In [91]:
for ci, class_name in enumerate(clf.classes_):
    print('coefficients for %s' % class_name)
    display(pd.DataFrame([coef[ci]], columns=vec.get_feature_names()))

coefficients for hostile


Unnamed: 0,hate,love,you
0,-0.297514,-0.838027,0.314006


coefficients for nonhostile


Unnamed: 0,hate,love,you
0,0.297514,0.838027,-0.314006


In [92]:
# sort coefficients by class.
features = vec.get_feature_names()
for ci, class_name in enumerate(clf.classes_):
    print('top features for class %s' % class_name)
    for fi in coef[ci].argsort()[::-1]: # descending order.
        print('%20s\t%.2f' % (features[fi], coef[ci][fi]))

top features for class hostile
                 you	0.31
                hate	-0.30
                love	-0.84
top features for class nonhostile
                love	0.84
                hate	0.30
                 you	-0.31


In [101]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

kf = KFold(n_splits=5, shuffle=True, random_state=42)
accuracies = []
for train, test in kf.split(X):
    clf.fit(X[train], y[train])
    pred = clf.predict(X[test])
    accuracies.append(accuracy_score(y[test], pred))
    
    
print('accuracy over all cross-validation folds: %s' % str(accuracies))
print('mean=%.2f std=%.2f' % (np.mean(accuracies), np.std(accuracies)))

accuracy over all cross-validation folds: [0.5424354243542435, 0.5476014760147602, 0.5276752767527675, 0.5402214022140222, 0.5236336779911374]
mean=0.54 std=0.01
