For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.
6. Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.



### 1. Data loading/cleaning/processing/language parsing

In [59]:
import pandas as pd
import numpy as np
import json

import time
import itertools
import spacy
import re
import string

from nltk.corpus import stopwords
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline

### Data cleaning / processing / language parsing

In [60]:
# create list of opinion writers at nytimes
writers = ['Charles M. Blow','David Brooks','Frank Bruni','Roger Cohen','Gail Collins',
          'Ross Douthat','Maureen Dowd','Thomas L. Friedman','Michelle Goldberg','Nicholas Kristof',
          'Paul Krugman','David Leonhardt','Andrew Rosenthal','Bret Stephens']
writers = [x.upper() for x in writers]
print(writers)

['CHARLES M. BLOW', 'DAVID BROOKS', 'FRANK BRUNI', 'ROGER COHEN', 'GAIL COLLINS', 'ROSS DOUTHAT', 'MAUREEN DOWD', 'THOMAS L. FRIEDMAN', 'MICHELLE GOLDBERG', 'NICHOLAS KRISTOF', 'PAUL KRUGMAN', 'DAVID LEONHARDT', 'ANDREW ROSENTHAL', 'BRET STEPHENS']


In [61]:
#import data from json - full text of op ed articles from ny times
op_ed_articles = pd.read_json('nytimes_oped_articles.json')
#print(op_ed_articles['full_text'][2])
op_ed_articles = op_ed_articles.reset_index(drop=True)
op_ed_articles.head()

Unnamed: 0,byline,date,full_text,subjects,word_count
0,ANDREW ROSENTHAL,2017-10-19,When most Americans think of domestic terroris...,"['Blacks', 'Police Brutality, Misconduct and S...",753
1,ANDREW ROSENTHAL,2017-10-04,It’s time to talk about taking away guns — not...,"['Gun Control', 'Firearms', 'Las Vegas, Nev, S...",776
2,ANDREW ROSENTHAL,2017-05-18,"In normal times, the appointment of a former F...",['Special Prosecutors (Independent Counsel)'],686
3,DAVID LEONHARDT,2017-11-29,This article is part of the Opinion Today news...,"['United States Politics and Government', 'Tax...",528
4,DAVID LEONHARDT,2017-11-02,This article is part of the Opinion Today news...,"['Taxation', 'Baseball', 'Officiating (Sports)...",725


In [62]:
# Utility function for standard text cleaning.
def text_cleaner(text):    
     # Visual inspection identifies a form of punctuation spaCy does not
    # remove parentheses
    text = re.sub("[\(|\)]", "", text)
    # remove non-ascii characters
    text = re.sub(r'[^\x00-\x7F]','', text)
    # remove single character words
    #text = re.sub(r"\b[a-zA-Z]\b", "", text)
    # remove double hyphen
    text = re.sub(r'--','-',text)
    # regular expression that replaces brackets and anything between them with nothing
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [63]:
# clean up the text 
for index, row in op_ed_articles.iterrows():
        i = row['full_text'].lower()
        i = text_cleaner(i)
        op_ed_articles.loc[index, "full_text"] = i

In [64]:
# tokenize the text
nlp = spacy.load('en')
op_ed_articles['full_text_tokenized'] = op_ed_articles['full_text'].apply(lambda x: nlp(x))


### Create features using two different NLP methods: For example, BoW vs tf-idf.

#### Bag of Words:

In [8]:
# create train/test dataset for full texts
y = op_ed_articles['byline']
X = op_ed_articles.drop(['byline'],axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, test_size=0.25)


In [9]:
# Create a new DataFrame with an entry per sentence for the training set
combine_Xtrain_ytrain = pd.concat([X,y],axis=1)
combine_Xtrain_ytrain = combine_Xtrain_ytrain.drop(['date','subjects','word_count','full_text'],axis=1)
combine_Xtrain_ytrain.head()

Unnamed: 0,full_text_tokenized,byline
0,"(when, most, americans, think, of, domestic, t...",ANDREW ROSENTHAL
1,"(its, time, to, talk, about, taking, away, gun...",ANDREW ROSENTHAL
2,"(in, normal, times, ,, the, appointment, of, a...",ANDREW ROSENTHAL
3,"(this, article, is, part, of, the, opinion, to...",DAVID LEONHARDT
4,"(this, article, is, part, of, the, opinion, to...",DAVID LEONHARDT


In [12]:
# create a dataframe of sentences for bag of words
opinions_sentences_training_set = pd.DataFrame(columns=['full_text_sentences','author'])
for writer in writers:
    for index, row in combine_Xtrain_ytrain.iterrows():
        if row['byline'] == writer:
            df_temp = pd.DataFrame()
            sentences = []
            author = writer
            sentences = [sent for sent in row['full_text_tokenized'].sents]
            df_temp = pd.DataFrame({'full_text_sentences': sentences,'author': author})
            opinions_sentences_training_set = pd.concat([opinions_sentences_training_set,df_temp])
opinions_sentences_training_set = opinions_sentences_training_set.reset_index(drop=True)
opinions_sentences_training_set.head(n=122)

Unnamed: 0,author,full_text_sentences
0,CHARLES M. BLOW,"(steve, bannon, may, no, longer, be, physicall..."
1,CHARLES M. BLOW,"(leadership, ,, breitbart, has, given, favorab..."
2,CHARLES M. BLOW,"(and, spencer, loves, it.yes, ,, that, richard..."
3,CHARLES M. BLOW,"(that, was, the, same, protest, about, which, ..."
4,CHARLES M. BLOW,"(i, do, nt, think, it, has, done, this, delibe..."
5,CHARLES M. BLOW,"(12, ,, the, wall, street, journal, reported, ..."
6,CHARLES M. BLOW,"(trump, still, frequently, consults, him, ,, a..."
7,CHARLES M. BLOW,"(if, alabama, voters, on, tuesday, elect, roy,..."
8,CHARLES M. BLOW,"(he, is, them, ,, and, they, are, him, .)"
9,CHARLES M. BLOW,"(any, pretense, of, tolerance, and, egalitaria..."


In [13]:
# create a merged text for each writer
# Create a new DataFrame with an entry per sentence for the training set
combine_Xtext_train_ytrain = pd.concat([X,y],axis=1)
combine_Xtext_train_ytrain = combine_Xtext_train_ytrain.drop(['date','subjects','word_count','full_text_tokenized'],axis=1)
combine_Xtext_train_ytrain= combine_Xtext_train_ytrain.sort_values(['byline'])
# print(combine_Xtext_train_ytrain)


df_writer_combined_text = pd.DataFrame()

for writer in writers:
    writer_text = ''
    writer_text_list = []
    n=0
    for index, row in combine_Xtext_train_ytrain.iterrows():
        if row['byline'] == writer:
            author = writer
            writer_text_list.append(row['full_text'])
    writer_text = " ".join(writer_text_list)
    df_temp = pd.DataFrame({'combined_text': writer_text,'author': author},index=[n])
    df_writer_combined_text = pd.concat([df_writer_combined_text,df_temp])
    n += 1
df_writer_combined_text = df_writer_combined_text.reset_index(drop=True)
df_writer_combined_text.head(n=20)   

Unnamed: 0,author,combined_text
0,CHARLES M. BLOW,i know that harvey is heavy on americas heart....
1,DAVID BROOKS,were living in the middle of a national crisis...
2,FRANK BRUNI,"across the country, college freshmen are settl..."
3,ROGER COHEN,"yangon, myanmar president trump is incidental ..."
4,GAIL COLLINS,donald trump has just voted in the worst cabin...
5,ROSS DOUTHAT,back when congressional republicans were rolli...
6,MAUREEN DOWD,"washington so, with this latest toad jumping f..."
7,THOMAS L. FRIEDMAN,i was talking the other day to a wise executiv...
8,MICHELLE GOLDBERG,"on a friday night last month, i moderated a de..."
9,NICHOLAS KRISTOF,"for decades, one of the most sanctimonious mor..."


In [14]:
# tokenize the text

df_writer_combined_text['combined_text_tokenized'] = df_writer_combined_text['combined_text'].apply(lambda x: nlp(x))

In [15]:
# Utility function to create a list of the 1000 most common words. 

def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(1000)]

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, words_in_articles):
    
    # Scaffold the data frame (training data/target dataset) and initialize counts to zero.
    df = pd.DataFrame(columns=words_in_articles)
    df['text_sentence'] = opinions_sentences_training_set['full_text_sentences']
    df['text_source'] = opinions_sentences_training_set['author']
    df.loc[:, words_in_articles] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    
    for i, sentence in enumerate(df['text_sentence']):
        
         # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     token.lemma_ in words_in_articles
                     and not token.is_punct
                     and not token.is_stop
    
                 )]
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
 
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df


In [16]:
# return most common words for writers

common_words = []
for i, row in df_writer_combined_text.iterrows():
    baggedwords = bag_of_words(row['combined_text_tokenized'])
    common_words.append(baggedwords)

flat_list=[]
for sublist in common_words:
    for item in sublist:
        flat_list.append(item)
        

words_in_articles = set(flat_list)
print(words_in_articles)
print(len(words_in_articles))

4461


In [17]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, words_in_articles)

print(word_counts.sum())
word_counts.head()

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500
Series([], dtype: float64)


Unnamed: 0,operative,justice.but,shrug,psalm,bad,police,ko,outrageous,product,abraham,...,flaw,mexico,pack,lesson,hundred,unpopular,throw,norm,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(steve, bannon, may, no, longer, be, physicall...",CHARLES M. BLOW
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(leadership, ,, breitbart, has, given, favorab...",CHARLES M. BLOW
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(and, spencer, loves, it.yes, ,, that, richard...",CHARLES M. BLOW
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(that, was, the, same, protest, about, which, ...",CHARLES M. BLOW
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(i, do, nt, think, it, has, done, this, delibe...",CHARLES M. BLOW


#### tf-idf:

In [52]:
# create features with tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer

ft_list=[]
for i,row in op_ed_articles.iterrows():
    article=row['full_text']
    ft_list.append(article)
#print(ft_list)

vectorizer = TfidfVectorizer(max_df=0.6, # drop words that occur in more than half the articles
                             stop_words='english', 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
#Applying the vectorizer
full_text_tfidf=vectorizer.fit_transform(ft_list)
print("Number of features: %d" % full_text_tfidf.get_shape()[1])


# #Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = full_text_tfidf.tocsr()

# #number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per article
tfidf_byarticle = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each article, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_byarticle[i][terms[j]] = X_train_tfidf_csr[i, j]
    
 #Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
#print('Original sentence:', op_ed_articles['full_text'][2])
#print('Tf_idf vector:', tfidf_byarticle[2])
print(len(tfidf_byarticle))

Number of features: 19740
307


#### Feature Reduction

In [23]:
# reduce number of words in bow

#Our SVD data reducer.  We are going to reduce the feature space from 1081 to 100.
svd= TruncatedSVD(1000)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X = np.array(word_counts.drop(['text_sentence','text_source'],axis=1))
X_train_lsa_bow = lsa.fit_transform(X)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

Percent variance captured by all components: 82.0058129488


### Use the features to fit supervised learning models for each feature set to predict the category outcomes.

#### Create training and testing sets


In [24]:
# for BoW dataset
Y = word_counts['text_source']
#X = np.array(word_counts.drop(['text_sentence','text_source'],axis=1))
X = X_train_lsa_bow
X_train_bow, X_test_bow, y_train_bow, y_test_bow = train_test_split(X, 
                                                    Y,stratify=Y,
                                                    test_size=0.25,
                                                    random_state=0)

#### Naive Bayes:

In [27]:
# using BoW dataset

from sklearn.naive_bayes import BernoulliNB

# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(X_train_bow, y_train_bow)
y_pred = bnb.predict(X_train_bow)

print('Training set score:', bnb.score(X_train_bow, y_train_bow))
print('\nTest set score:', bnb.score(X_test_bow, y_test_bow))

Training set score: 0.583207944628

Test set score: 0.312274368231


In [29]:
# using tf-idf dataset
# Fit our model to the data.
bnb.fit(X_train_tfidf, y_train_tfidf)
y_pred = bnb.predict(X_train_tfidf)

print('Training set score:', bnb.score(X_train_tfidf, y_train_tfidf))
print('\nTest set score:', bnb.score(X_test_tfidf, y_test_tfidf))

Training set score: 0.978260869565

Test set score: 0.493506493506


#### Logistic Regression:

In [31]:
from sklearn.linear_model import LogisticRegression

# using BoW dataset
lr = LogisticRegression()
train = lr.fit(X_train_bow, y_train_bow)

print('Training set score:', lr.score(X_train_bow, y_train_bow))
print('\nTest set score:', lr.score(X_test_bow, y_test_bow))

Training set score: 0.605025579296

Test set score: 0.43321299639


In [33]:
# using tf-idf dataset
lr = LogisticRegression()
train = lr.fit(X_train_tfidf, y_train_tfidf)
print('Training set score:', lr.score(X_train_tfidf, y_train_tfidf))
print('\nTest set score:', lr.score(X_test_tfidf, y_test_tfidf))

Training set score: 0.934782608696

Test set score: 0.571428571429


### Pick one of the models and try to increase accuracy by at least 5 percentage points
I chose to work with the logistic regression model and the tf-idf dataset since it appears to have the strongest performance on the test dataset - although it obviously has a problem with overfitting
1. Starting Values (training/test): .93/.57
2. Added word count and unique word count as features: .88/.58
3. Reduced the number of features from ~2000 to 150: .88/.64

In [213]:
# create features with tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer

ft_list=[]
for i,row in op_ed_articles.iterrows():
    article=row['full_text']
    ft_list.append(article)
#print(ft_list)

vectorizer = TfidfVectorizer(max_df=0.4, # drop words that occur in more than half the articles
                             stop_words='english', 
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
#Applying the vectorizer
full_text_tfidf=vectorizer.fit_transform(ft_list)
print("Number of features: %d" % full_text_tfidf.get_shape()[1])


# #Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = full_text_tfidf.tocsr()

# #number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per article
tfidf_byarticle = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each article, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_byarticle[i][terms[j]] = X_train_tfidf_csr[i, j]
    
 #Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
#print('Original sentence:', op_ed_articles['full_text'][2])
print('Tf_idf vector:', tfidf_byarticle[2])

Number of features: 20111
Tf_idf vector: {'federal': 0.030719553064942735, 'investigation': 0.28823770036879054, 'counterterrorism': 0.05344914344408784, 'real': 0.022574254958545651, 'law': 0.027172893638431898, 'justice': 0.030286563288530473, 'pointed': 0.035331192708502809, 'history': 0.023440581273114932, 'entire': 0.034998141785804369, 'attorney': 0.12125455031623039, 'general': 0.060152791002577762, 'chief': 0.030286563288530473, 'news': 0.022678828728163757, 'inevitably': 0.042833333251246039, 'meeting': 0.17028195732978368, 'motivated': 0.050335473793989262, 'connection': 0.040973346200979394, 'attention': 0.027999501372805085, 'possibility': 0.033471205658236171, 'decided': 0.034998141785804369, 'conspiracy': 0.084354349447836693, 'away': 0.023901216788586321, 'supporters': 0.037578536044032107, 'hillary': 0.028348670393727394, 'clinton': 0.028348670393727394, 'little': 0.046434825044399018, 'times': 0.063076294381837095, 'need': 0.044941363820407872, 'far': 0.021299453715999

In [214]:
# reduce number of features in tf-idf

from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 1081 to 100.
svd= TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa_tfidf = lsa.fit_transform(full_text_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)




Percent variance captured by all components: 50.8360155988


In [215]:
### Additional features
op_ed_articles_numwords = op_ed_articles.filter(items=['word_count'],axis=1)
op_ed_articles_numwords.head(n=20)
X_train_lsa_tfidf=pd.DataFrame(X_train_lsa_tfidf)
data_tfidf = pd.concat([X_train_lsa_tfidf, op_ed_articles_numwords],axis=1)
data_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,word_count
0,0.195629,-0.136996,-0.081917,-0.087015,0.034639,0.052266,0.007678,-0.123814,0.07418,0.030165,...,0.123781,0.199371,-0.064141,0.066511,-0.127108,-0.006951,0.003398,-0.057107,-0.072921,753
1,0.207118,-0.110172,-0.042748,-0.054551,0.067294,0.627555,0.191517,0.162841,-0.053989,-0.010135,...,-0.044783,-0.048705,0.070934,-0.057049,-0.035751,0.010536,0.02005,0.0193,0.013381,776
2,0.220139,-0.150532,-0.101636,-0.150315,0.698596,-0.178704,0.023079,0.030374,-0.152636,-0.139953,...,0.015931,0.021279,0.01169,0.048016,-0.047894,0.002439,-0.04479,-0.01635,0.054137,686
3,0.493741,0.312962,0.198156,0.153431,0.090953,-0.019078,-0.045009,0.049372,0.010701,0.016947,...,0.093624,-0.021245,0.067245,-0.022356,0.012096,-0.047093,-0.02654,0.1376,0.088358,528
4,0.256949,0.103302,0.012789,-0.039456,0.013495,-0.007447,0.038815,0.037205,0.122119,0.32293,...,-0.004514,0.00179,0.010287,0.008672,0.011999,0.004227,0.004104,0.005213,-0.007518,725


In [216]:
# Look at some metrics around this sentence.
unique_word_counts = []
total_punctuation_counts = []
for i, row in op_ed_articles.iterrows():
    total_words = [token for token in row['full_text_tokenized'] if not token.is_punct]
    total_punct = [token for token in row['full_text_tokenized'] if token.is_punct]
    unique_words = set([token.text for token in total_words])
    unique_wc = len(unique_words)
    punc_count = len(total_punct)
    unique_word_counts.append(unique_wc)
    total_punctuation_counts.append(punc_count)
unique_words = pd.Series(unique_word_counts)
punctuation = pd.Series(total_punctuation_counts)

data_tfidf = pd.concat([data_tfidf, unique_words],axis=1)
#data_tfidf = pd.concat([data_tfidf, punctuation],axis=1)
data_tfidf = data_tfidf.T.reset_index(drop=True).T
data_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
0,0.195629,-0.136996,-0.081917,-0.087015,0.034639,0.052266,0.007678,-0.123814,0.07418,0.030165,...,0.199371,-0.064141,0.066511,-0.127108,-0.006951,0.003398,-0.057107,-0.072921,753.0,391.0
1,0.207118,-0.110172,-0.042748,-0.054551,0.067294,0.627555,0.191517,0.162841,-0.053989,-0.010135,...,-0.048705,0.070934,-0.057049,-0.035751,0.010536,0.02005,0.0193,0.013381,776.0,371.0
2,0.220139,-0.150532,-0.101636,-0.150315,0.698596,-0.178704,0.023079,0.030374,-0.152636,-0.139953,...,0.021279,0.01169,0.048016,-0.047894,0.002439,-0.04479,-0.01635,0.054137,686.0,319.0
3,0.493741,0.312962,0.198156,0.153431,0.090953,-0.019078,-0.045009,0.049372,0.010701,0.016947,...,-0.021245,0.067245,-0.022356,0.012096,-0.047093,-0.02654,0.1376,0.088358,528.0,297.0
4,0.256949,0.103302,0.012789,-0.039456,0.013495,-0.007447,0.038815,0.037205,0.122119,0.32293,...,0.00179,0.010287,0.008672,0.011999,0.004227,0.004104,0.005213,-0.007518,725.0,392.0


In [217]:
# for tf-idf dataset
Y = op_ed_articles['byline']
X = data_tfidf  

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X, 
                                                    Y,stratify=Y,
                                                    test_size=0.25,
                                                    random_state=0)

In [218]:
# using tf-idf dataset
lr = LogisticRegression()
train = lr.fit(X_train_tfidf, y_train_tfidf)
print('Training set score:', lr.score(X_train_tfidf, y_train_tfidf))
print('\nTest set score:', lr.score(X_test_tfidf, y_test_tfidf))

Training set score: 0.878260869565

Test set score: 0.636363636364


### Conclusions:
I was able improve the test set score by about 5 points by adding word count features and reducing the feature set using SVD