This notebook will load three different classifiers to label tweets hierarchically at 3 different levels. First, they will be sorted into behavior-related vs. not related. Second, they will be classified into 1st person vs. not 1st person. Third, a label will be assigned to first-person, behavior-related tweets that show time-related activity in the past, present, or future. 

First, we load the requisite modules into the environmment and change our base directory.

In [30]:
import os
import pickle
import sklearn
import pandas
import numpy

In [31]:
os.chdir('/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-twipsy/')

Next, we load the pickle files which contain our classifiers. 

In [32]:
clf_tob = pickle.load(open('/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/tobacco-classifiers/classifier1_tobacco.p', 'rb'))
clf_fpt = pickle.load(open('/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/tobacco-classifiers/classifier2_firstPerson.p', 'rb'))
clf_fpl = pickle.load(open('/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/tobacco-classifiers/classifier3_present.p', 'rb'))

In [33]:
# set the threshold: 99% of data is not alcohol related
# JM Comment: this did not seem to be needed with alcohol tobacco, will try without this step to see how it goes... 
#clf_tob.steps[2][1].class_weight = {0:0.99, 1:0.01}

Here we will use some pre-classified tweets and regenerate the tobacco-related vs. not tobacco-related, first-person vs. not first-person, and first-person-level labels. Since the data were used to train the classifier initially and were hand-annotated by MTurk workers, We should not expect to come up with identical labels for these levels of classification.

There are multiple training datasets for each level of classification. We will load them all below. Then, we will see how many of each label there are. 

In [34]:
df_comb = pandas.read_csv("C:/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/training-data/combined.csv")
df_tob = pandas.read_csv("C:/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/training-data/tob.csv")
df_1p = pandas.read_csv("C:/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/training-data/fp.csv")
df_pres = pandas.read_csv("C:/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/training-data/present.csv")
df_curr = pandas.read_csv("C:/Users/dethf/NYU/Spring 2019/Chunara Internship/nyu-research/tobacco/training-data/current.csv")

In [9]:
# printing number of tobacco vs. not tobacco tweets labeled
print("NOT-Tobacco:", numpy.unique(df_tob.labels, return_counts=True)[1][0], 
      "Tobacco:", numpy.unique(df_tob.labels, return_counts=True)[1][1])

NOT-Tobacco: 22492 Tobacco: 3676


We see there are 22492 tweets labeled not tobacco related while 3676 were tobacco related. We next run  the classifier on this prelabeled dataset to see how many of the original labels we can reproduce.

In [10]:
# running the classifier
tob_preds = clf_tob.predict(df_tob)

In [11]:
# finding out how many tweets there were in each group
print("NOT-Tobacco:", numpy.unique(tob_preds, return_counts=True)[1][0],
      "Tobacco:",numpy.unique(tob_preds, return_counts=True)[1][1])

NOT-Tobacco: 22492 Tobacco: 3676


We generate the same number of tweets per group, but we will now check to see if there were any mislabeled tweets and check the predicted probabilities for either label.

In [12]:
df_tob[tob_preds != df_tob.labels]

Unnamed: 0,labels,text
5161,NOT-Tobacco,Do u even vape
24820,Tobacco,Who tryna smoke


In [13]:
tob_probs = clf_tob.predict_proba(df_tob)

In [14]:
tob_probs[numpy.where(tob_preds != df_tob.labels)[0]]

array([[0.48639826, 0.51360174],
       [0.51083009, 0.48916991]])

We can see from the prediction probabilities that it was nearly a toss-up as to the label those tweets got. Those tweets were clearly difficult to categorize.

Only two tweets mislabeled out of over 20,000 is remarkably accurate.

Next we will run the first person tobacco classifier on the tweets used for its training to see how many of the original labels we can reproduce.

In [15]:
# running classifier
fp_preds = clf_fpt.predict(df_1p)

In [16]:
# printing number of first-person vs. not first person tweets annotated (by hand)
print("NOT-1stPerson:", numpy.unique(df_1p.labels, return_counts=True)[1][0], 
      "1stPersont:", numpy.unique(df_1p.labels, return_counts=True)[1][1])

NOT-1stPerson: 1628 1stPersont: 2048


In [17]:
# printing number of first person vs. not first person tweets the classifier predicted
print("NOT-1stPerson:", numpy.unique(fp_preds, return_counts=True)[1][0], 
      "1stPerson:", numpy.unique(fp_preds, return_counts=True)[1][1])

NOT-1stPerson: 1626 1stPerson: 2050


We don't generate the same label distribution. We check next to see what tweets were mislabeled and their prediction probabilities as we did above:

In [18]:
df_1p[fp_preds != df_1p.labels]

Unnamed: 0,labels,text
1399,1stPerson,I hate cigarettes
2176,1stPerson,look at this cool vape trick


In [19]:
fp_probs = clf_fpt.predict_proba(df_1p)
fp_probs[numpy.where(fp_preds != df_1p.labels)[0]]

array([[0.48797202, 0.51202798],
       [0.48560346, 0.51439654]])

As before, the two mislabeled tweets seem clearly difficult to categorize as either first person or not. 

We now continue with the next level of classification: present

In [20]:
# running the classifier
pres_preds = clf_fpl.predict(df_pres)

Now we check to see if our classification reproduced the original annotations:

In [21]:
# printing number of present vs. not present tweets annotated (by hand)
print("NOT-Present:", numpy.unique(df_pres.labels, return_counts=True)[1][0], 
      "Present:", numpy.unique(df_pres.labels, return_counts=True)[1][1])

NOT-Present: 662 Present: 966


In [22]:
# printing number of present vs. not present tweets classified
print("NOT-Present:", numpy.unique(pres_preds, return_counts=True)[1][0], 
      "Present:", numpy.unique(pres_preds, return_counts=True)[1][1])

NOT-Present: 660 Present: 968


Again, we do not generate the same label distribution. We check next to see what tweets were misclassified and what their prediction probabilities were:

In [23]:
df_pres[pres_preds != df_pres.labels]

Unnamed: 0,labels,text
455,NOT-Present,Hookah bar
540,NOT-Present,I got a project in mind so wine glass and ciga...


In [24]:
pres_probs = clf_fpl.predict_proba(df_pres)
pres_probs[numpy.where(pres_preds != df_pres.labels)[0]]

array([[0.19836836, 0.80163164],
       [0.49771675, 0.50228325]])

Again, remarkably, there are only two mistakes made by the classifier; both tweets are mislabeled as NOT-Present. Only the second tweet was difficult to categorize. The first tweet, "Hookah bar," was given a more obvious classification, but its short length may have contributed to this somehow.

In [25]:
conf_mat_pres = confusion_matrix(pres_preds, df_pres.labels)
print(conf_mat_pres)

NameError: name 'confusion_matrix' is not defined

We have finished running our classifiers, but there still remains the "Current" training data which are the same tweets used to train the first person level classifier. Since the existing classifiers will not return any 'current' labels, we will not be able to test their accuracy. However, we know from the Tom Huang paper that 'present' and 'habitual' posts were combined. It may be that these combined tweets were then given the label 'Current.' We will attempt to figure that out.

In [26]:
# running the classifier
curr_preds = clf_fpl.predict(df_curr)

In [27]:
numpy.unique(df_curr.labels, return_counts = True)

(array(['current', 'not-current'], dtype=object),
 array([1398,  230], dtype=int64))

In [28]:
numpy.unique(df_pres.labels, return_counts = True)

(array(['NOT-Present', 'Present'], dtype=object),
 array([662, 966], dtype=int64))

We see there are 1398 tweets labeled current and 230 not current. We also see again that there are 966 tweets labeled Present, and 662 tweets labeled NOT-Present. If all the 'Present' tweets also share the label 'Current,' they must be a subset of the entire 'Current' tweets. We will check this now:

In [29]:
sum(df_curr.labels[numpy.where(df_pres.labels == "Present")[0]] == "current")

966

As we suspected, the 'Current' label is just the 'Present' label and probably a 'Habitual' one that were combined.

We can see below the 1398 tweets labeled current above are in the final 'Combined' dataset, representing the combined labels at each level of classification: Tobacco, First Person, and Current. 

In [None]:
numpy.unique(df_comb.labels, return_counts=True)