In this lab, we will
- read our project data into a Pandas DataFrame
- write a function to compute simple features for each row of the data frame
- fit a LogisticRegression model to the data
- print the top coefficients
- compute measures of accuracy

I've given you starter code below. You should:
- First, try to get it to work with your data. It may require changing the load_data file to match the requirements of your data (e.g., what is the object you are classifying -- a tweet, a user, a news article?)
- Second, you should add additional features to the make_features function:
  - Be creative. It could be additional word features, or other meta data about the user, date, etc.
- As you try out different feature combinations, print out the coefficients and accuracy scores
- List any features that seem to improve accuracy. Why do you think that is?

In [1]:
from collections import Counter
import numpy as np
import pandas as pd
import re
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer

In [33]:
def load_data(datafile, checkfile):
    """
    Read your data into a single pandas dataframe where
    - each row is an instance to be classified
    
    (this could be a tweet, user, or news article, depending on your project)
    - there is a column called `label` which stores the class label (e.g., the true
      category for this row)
    """
    df = pd.read_csv(datafile)[['social_id','comment_tokens']]
    ck = pd.read_csv(checkfile)
    
    ck = ck.loc[ck['site'] == 'twitter', ['site', 'social_id', 'ruling_val']]
    
    
    ck['social_id'] = ck['social_id'].astype(df['social_id'].dtype)
    
    df.columns = ['id', 'text']
    df = df.drop_duplicates(['id','text'])
    ck.columns = ['site','id','label']
   # ck['label'] = [i.lower() for i in ck.label]
    df = pd.merge(ck,df,on=['id'],how = 'inner')
    return df

df = load_data('..\\..\\training_data\\twitter.csv.gz', '..\\..\\training_data\\factchecks.csv')
df

Unnamed: 0,site,id,label,text
0,twitter,920307162278236160,-2.0,rt @allisonpohle : o-h-i-no ! this story wasn'...
1,twitter,920464477388029952,-2.0,"rt @theellenshow : tomorrow , the first people..."
2,twitter,918272604410122240,-2.0,rt @lauraloomer : investigative journalist lau...
3,twitter,918689132137791488,-2.0,rt @lauraloomer : #jesuscampos be miss ? laura...
4,twitter,913244272643653632,-2.0,rt @tacticalpoet84 : urlref\nfollow-up from my...
5,twitter,911837232633323520,-2.0,rt @tacticalpoet84 : find out my video mean fo...
6,twitter,911652673002266624,-2.0,rt @jjmacnab : militia extremist be be agitate...
7,twitter,917083461197946880,-2.0,rt @jasonlacanfora : report about @kaepernick7...
8,twitter,917083461197946880,-2.0,rt @jasonlacanfora : report about @kaepernick7...
9,twitter,917082971030552576,-2.0,rt @jasonlacanfora : stand for anthem wasn't s...


In [3]:
# what is the distribution over class labels?
df.label.value_counts()

-2.0    650
 0.0    201
-1.0    176
 2.0    141
 1.0     89
Name: label, dtype: int64

In [4]:
def tweet_tokenizer(s):
    s = re.sub(r'#(\S+)', r'HASHTAG_\1', s)
    s = re.sub(r'@(\S+)', r'MENTION_\1', s)
    s = re.sub(r'http\S+', 'THIS_IS_A_URL', s)
    return re.sub('\W+', ' ', s.lower()).split()

def counters(d):
    counts = Counter()  # handy object: dict from object -> int
    counts.update(d)
    return counts

tokens = [token for tweet in df['text'] for token in tweet_tokenizer(tweet)]
counts = counters(tokens)
words_to_track = [i[0] for i in counts.most_common(70)]
words_to_track

['rt',
 'the',
 'urlref',
 'be',
 'to',
 'a',
 'in',
 'of',
 's',
 'and',
 'i',
 'it',
 'on',
 'for',
 'mention_realdonaldtrump',
 'this',
 'that',
 't',
 'have',
 'we',
 'trump',
 'wa',
 'with',
 'you',
 'at',
 'not',
 'he',
 'do',
 'ha',
 'no',
 'my',
 'u',
 'from',
 'by',
 'say',
 'they',
 'our',
 'if',
 'more',
 'people',
 'about',
 'an',
 'his',
 'will',
 'but',
 'just',
 'what',
 'would',
 'get',
 'obama',
 'go',
 'all',
 'out',
 'take',
 'one',
 'up',
 'me',
 'make',
 'than',
 'should',
 'now',
 'so',
 'president',
 'think',
 'who',
 'your',
 'or',
 'today',
 'can',
 'there']

In [28]:
def make_features(df):
    vec = DictVectorizer()
    feature_dicts = []
    # just as an initial example, we will consider three
    # word features in the model.
    words_to_track = ['think', 'today', 'should']
    for i, row in df.iterrows():
        features = {}
        token_counts = Counter(re.sub('\W+', ' ', row['text'].lower()).split())
        for w in words_to_track:
            features[w] = token_counts[w]
        feature_dicts.append(features)
    X = vec.fit_transform(feature_dicts)
    return X, vec
                
X, vec = make_features(df)

In [6]:
# what are dimensions of the feature matrix?
X.shape

(1257, 3)

In [7]:
# what are the feature names?
# vocabulary_ is a dict from feature name to column index
vec.vocabulary_

{'think': 1, 'today': 2, 'should': 0}

In [32]:
# how often does each word occur?
for word, idx in vec.vocabulary_.items():
    print('%15s\t%d' % (word, X[:,idx].sum()))

          think	43
          today	41
         should	44


In [9]:
# can also get a simple list of feature names:
vec.get_feature_names()
# e.g., first column is 'hate', second is 'love', etc.

['should', 'think', 'today']

In [10]:
# we'll first store the classes separately in a numpy array
y = np.array(df.label)
Counter(y)

Counter({-2.0: 650, -1.0: 176, 0.0: 201, 1.0: 89, 2.0: 141})

In [11]:
# to find the row indices with hostile label
np.where(y==0.0)[0]

array([ 826,  827,  828,  829,  830,  831,  832,  833,  834,  835,  836,
        837,  838,  839,  840,  841,  842,  843,  844,  845,  846,  847,
        848,  849,  850,  851,  852,  853,  854,  855,  856,  857,  858,
        859,  860,  861,  862,  863,  864,  865,  866,  867,  868,  869,
        870,  871,  872,  873,  874,  875,  876,  877,  878,  879,  880,
        881,  882,  884,  885,  886,  887,  888,  889,  890,  891,  892,
        893,  894,  895,  896,  897,  898,  899,  900,  901,  902,  903,
        904,  905,  906,  907,  908,  909,  910,  911,  912,  913,  914,
        915,  916,  917,  918,  919,  920,  921,  922,  923,  924,  925,
        926,  927,  928,  929,  930,  931,  932,  933,  934,  935,  936,
        937,  938,  939,  940,  941,  942,  943,  944,  945,  946,  947,
        948,  949,  950,  951,  952,  953,  954,  955,  956,  957,  958,
        959,  960,  961,  962,  963,  964,  965,  966,  967,  968,  969,
        970,  971,  972,  973,  974,  975,  976,  9

In [12]:
# store the class names
class_names = set(df.label)

In [13]:
# how often does each word appear in each class?
for word, idx in vec.vocabulary_.items():
    for class_name in class_names:
        class_idx = np.where(y==class_name)[0]
        print('%20s\t%20s\t%d' % (word, class_name, X[class_idx, idx].sum()))

               think	                 0.0	6
               think	                 1.0	0
               think	                 2.0	7
               think	                -1.0	8
               think	                -2.0	22
               today	                 0.0	4
               today	                 1.0	2
               today	                 2.0	1
               today	                -1.0	8
               today	                -2.0	26
              should	                 0.0	7
              should	                 1.0	3
              should	                 2.0	6
              should	                -1.0	8
              should	                -2.0	20


So, `you` appears more frequently in positive (hostile) class, and `love` appears more frequently in the negative (non-hostile) class.

In [14]:
# fit a LogisticRegression classifier.
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial')
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
# for binary classification, LogisticRegression stores a single coefficient vector
clf.coef_
# this would be a matrix for a multi-class probem.

array([[-0.16441574,  0.13729807,  0.45481963],
       [ 0.19063046,  0.38394791,  0.54849697],
       [-0.05914492,  0.01068498, -0.19520814],
       [-0.0681355 , -0.97844635, -0.08179174],
       [ 0.1010657 ,  0.4465154 , -0.72631671]])

In [16]:
# for binary classification, the coefficients for the negative class is just the negative of the positive class.
coef = clf.coef_
print(coef)

[[-0.16441574  0.13729807  0.45481963]
 [ 0.19063046  0.38394791  0.54849697]
 [-0.05914492  0.01068498 -0.19520814]
 [-0.0681355  -0.97844635 -0.08179174]
 [ 0.1010657   0.4465154  -0.72631671]]


In [17]:
for ci, class_name in enumerate(clf.classes_):
    print('coefficients for %s' % class_name)
    display(pd.DataFrame([coef[ci]], columns=vec.get_feature_names()))

coefficients for -2.0


Unnamed: 0,should,think,today
0,-0.164416,0.137298,0.45482


coefficients for -1.0


Unnamed: 0,should,think,today
0,0.19063,0.383948,0.548497


coefficients for 0.0


Unnamed: 0,should,think,today
0,-0.059145,0.010685,-0.195208


coefficients for 1.0


Unnamed: 0,should,think,today
0,-0.068136,-0.978446,-0.081792


coefficients for 2.0


Unnamed: 0,should,think,today
0,0.101066,0.446515,-0.726317
