# Assignment #3: Exploratory Analysis by Linear Support Vector Classifier, By Luis Macias and Yitz Deng

For this assignment, we decide to do something with supervised machine learning. Specifically, using a Support Vector Machine modeled on words. Having a massive data set of 3.5 million rows of headlines from the New York Times from years 1980 -to 2016, collected using the NYTimes archive API. We wondered with this much data would it be possible to be able to predict years solely on the headline of the article. The intuition being that important events that shape a year or era will usually be important enough to be reported on in a newspaper headline. For example, we don’t expect headlines of the Soviet Union’s fall to be found in the headlines of the 2000’s and above. Thus, the following cells below show our undertaking and commentary of this exploratory analysis. We will discuss our finding at the very end. 

In [1]:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score #to asses the accuracy of the algorithm
import numpy as np
from sklearn.svm import LinearSVC #Linear Suppot Vector Classifier

3.5 million rows of data are a doubled edged sword. In loading this up to Juypter Notebook, there were a slew of NA’s and empty sections in rows in our data. This occurred because the archive API does not include information such as what sections was an article published in for older headlines. While later rows ($2000$’s and above) did include whether a article was published in the arts, business section etc.  Thus, in loading them we converted these empty sections into $NA$’s as they would be easier to filter out later.

In [3]:
df = pandas.read_csv("allData.csv", header = 0, na_values = '')

  interactivity=interactivity, compiler=compiler, result=result)


Another issue that cropped up in our Exploratory analysis was the labeling of our data. The years 2015 and 2016 were not as rich as the years before it, thus we omitted those years so as to not mess with our Linear Support Vector Classifier. Another issue was that having all the years from $1980 - 2014$ would make our predictions go a lot slower as they were 34 potential categories($1980,1981\ldots 2014$) for a prediction to be classified into.  Thus, we decided to group our year labels, going by a period of 4 years culminating into a final of 7 categories. While still not as fast as a simple binary classifier, our Linear Support Vector Classifier would finish training and classifying much faster than the original 34 categories.

In [4]:
def getYear(date):
    year = int(date[2:4])
    if(80 <= year <= 84):
        return 0
    if(85 <= year <= 89):
        return 1
    if(90 <= year <= 94):
        return 2
    if(95 <= year <= 99):
        return 3
    if(0 <= year <= 4):
        return 4
    if(5 <= year <= 9):
        return 5
    if(10 <= year <= 14):
        return 6
    return -1


df['Year'] = df['Date'].map(getYear)
df = df[df['Year'] != -1] #filtering out 2015, 2016 

In [5]:
df2 = df.sample(frac = 1) #remove random_state and scrambling our data set 

We tried a trial training and testing data set that encompassed 15000 rows of data and quickly figured out that processing all of this data would take a significant amount of time to run. Our COU’s and RAM shot into the ceiling, prompting us to move our training and prediction onto a remote server.

In [6]:
training = df2[:10000][['Headline', 'Year']].dropna()
test = df2[10000:15000][['Headline', 'Year']].dropna()

In [7]:
training # show casing training data set to show that it was sampled from our original 

Unnamed: 0,Headline,Year
1198354,Sarah Eusden Is Engaged to Charles A. Gallop,2
1855815,"Paid Notice: Deaths ABRAM, MORRIS B.",4
1912635,"Under Circumstances, No Pomp as Clinton Signs ...",4
1172121,They Came to California for the Good Life; Now...,2
584575,NIKE INC reports earnings for Qtr to Aug 31,1
543095,CHERRY ELECTRICAL PRODCTS CORP reports earning...,1
3066646,Returning Upriver With Very Few Fish,6
3337175,Boys Don’t Run Away From These Princesses,6
2551345,Walk Tightropes. Teach Yoga. Fight Terrorists.,5
25175,Around the Nation; 29 Hurt and 300 Evacuated I...,0


Still on our small set of $15000$ rows of data, we turned these into a document term matrix. We also coded in a requirement that a word show up more than 3 times for it to be included in our document term matrix.  Even with the above being done machine learning still took a significant amount of time, but would help signficantly when moved to a remote server. 

In [25]:
#transform the 'body' column into a document term matrix
#tfidfvec = TfidfVectorizer(stop_words = 'english', min_df = 3, binary=True)
countvec = CountVectorizer(stop_words = 'english', min_df = 3, binary=True)

training_dtm_tf = countvec.fit_transform(training.Headline)
test_dtm_tf = countvec.transform(test.Headline)

#create an array for labels
training_labels = training.Year
test_labels = test.Year
test_labels.value_counts()
test_labels # just sanity chekcing  that all 7 labels were present in our data set

249465     0
2051700    4
448360     0
2060099    4
3393865    6
1743627    3
778760     1
3337332    6
2629840    5
1048524    2
1179646    2
246328     0
3112725    6
2233928    4
1224544    2
2571903    5
293865     0
1111658    2
24852      0
2701309    5
2997016    6
1815987    3
2266706    4
83011      0
1273438    2
1080934    2
1465416    3
2974819    6
184140     0
1868746    4
          ..
1974150    4
1545803    3
1817365    3
1863993    4
1686936    3
1398971    2
1755145    3
1612668    3
763487     1
52848      0
2037162    4
245926     0
1895878    4
465725     0
3140220    6
487252     0
1964459    4
3225455    6
1620100    3
1300388    2
2740157    5
2498187    5
2460510    5
3103861    6
278804     0
1919171    4
345081     0
1168791    2
25896      0
3378234    6
Name: Year, dtype: int64

In [9]:
training_dtm_tf # seeing the dimensions of our data set 

<10000x3551 sparse matrix of type '<class 'numpy.int64'>'
	with 34788 stored elements in Compressed Sparse Row format>

### Results of the 15000 rows of headlines.
The intital results showed a 24% accuracy rating whihc was better than picking uniformly at random. Prompting us to scale our machine learning and use all 3.5 million rows of data.

In [10]:
#inital results of our small smaple of the data
#svc = LinearSVC()
#svc.fit(training_dtm_tf, training_labels)
#predictions_svc = svc.predict(test_dtm_tf) 
#accuracy_score(predictions_svc, test_labels)

0.2432

## Final results of the 3.5 Million rows of headlines.

The following are results of training 2.5 million headlines and testing them.
We also included the results of training 2 million, 1.5 million , 1 million and son on and so forth to see the trade offs between the amount of training data and accuracy. 

In [10]:
#final results
predictions2500000 = open("results2500000.txt").read().split()
correct2500000 = open("correct2500000.csv").read().split()

print(accuracy_score(predictions2500000, correct2500000))

predictions2000000 = open("results2000000.txt").read().split()
correct2000000 = open("correct2000000.csv").read().split()

print(accuracy_score(predictions2000000, correct2000000))

predictions1500000 = open("results1500000.txt").read().split()
correct1500000 = open("correct1500000.csv").read().split()

print(accuracy_score(predictions1500000, correct1500000))

predictions1000000 = open("results1000000.txt").read().split()
correct1000000 = open("correct1000000.csv").read().split()

print(accuracy_score(predictions1000000, correct1000000))

predictions500000 = open("results500000.txt").read().split()
correct500000 = open("correct500000.csv").read().split()

print(accuracy_score(predictions500000, correct500000))

predictions100000 = open("results100000.txt").read().split()
correct100000 = open("correct100000.csv").read().split()

print(accuracy_score(predictions100000, correct100000))

predictions50000 = open("results50000.txt").read().split()
correct50000 = open("correct50000.csv").read().split()

print(accuracy_score(predictions50000, correct50000))


0.386579770539
0.384503338701
0.379538488446
0.371182508347
0.357177206673
0.313646754123
0.291500530357


### Conclusion

Our best SVM model, trained on $2500000$ article headlines, was able to correct classify the approximate year about $38.658\%$ of the time. This is significantly better than randomly selecting one of the possible seven options, $p_c = \frac{1}{7}\approx .14 $
With 3.5 million headlines and a 38% accuracy, the results show that our approach of predicting a set of years based on a NY Times article headline is flawed and probably not a good one. Even with more optimizing of the bias variance tradeoff this model would not an accurate predictor of years.

Our expectations of the results were mixed. While we certainly did not think, they would a be a good predictor of years, we were surprised that the accuracy was greater than picking at random.  The results make sense to us. Seeing as how there are indeed news worthy event that would date the time of publication (our downfall of the Soviet Union example still holds), there are many more events that do not time a publication, for example news of shootings and stabbings are unfortunately common that they ended up being featured monthly regardless of the year.  Other newsworthy events span more than the period of time and were not encoded with distinctions, for example talk of a Clinton in politics could be found in the late 1990’s and around in the 2008 Democratic Primary but headlines don’t often specify if its Hillary Clinton or Bill Clinton.  

As for new hypothesis generated by this assignment.  We wondered if instead of doing a supervised training model, an unsupervised model was applied to this data set. What type of cluster would be generated? This would be very interesting to see especially for old headlines, we would see where they would be clustered and if it would be similar to current sections of the NY Times (art, business, etc). Additionally, if unsupervised clustering was applied to all of our data, would the clusters be the same as the sections of the NY Times or what they show that there are actually more clusters and therefore more ways to classify articles currently . 

Overall this is was an insightful look into doing text analysis using the skills we learned from this course , and we are eager to put these skills into use for our final project.