### Lets understand more on TFIDF 



A tf-idf vectorization of a corpus of text documents assigns each word in a document a number that is proportional to its frequency in the document and inversely proportional to the number of documents in which it occurs

Very common words, such as “a” or “the”, thereby receive heavily discounted tf-idf scores, in contrast to words that are very specific to the document in question. The result is a matrix of tf-idf scores with one row per document and as many columns as there are different words in the dataset. (Ref https://buhrmann.github.io/tfidf-analysis.html)

In [3]:
import numpy as np
import pandas as pd 
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import PunktSentenceTokenizer , TreebankWordTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score, accuracy_score , recall_score , precision_score





In [4]:
df_train = pd.read_csv('data/train_kaggle.csv', sep=',', encoding='utf-8')

In [5]:
df_train['text'].apply(type).unique()
# explore more why the text type is coming as float

array([<class 'str'>, <class 'float'>], dtype=object)

In [6]:
X = df_train['text'].drop(df_train[df_train['text'].apply(type) == float].index)

In [7]:
X.apply(type).unique()

array([<class 'str'>], dtype=object)

In [None]:
treebank_word_tokenize = TreebankWordTokenizer().tokenize
tokens = [treebank_word_tokenize(content.lower()) for content in X]

In [None]:
nltk.download('stopwords') # figure out a way to not download it every time?
stop_word = set(stopwords.words('english')) #only unique stop words

In [8]:
#change above X if we r using tokenize and other nlp process
#Dropping the Nan values and info
df_train.dropna(inplace=True)
X = df_train['text']
y = df_train['label']

# do the similar think on headline (author ?? or source)
# name these to test , validation
X_train,  X_test,  y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=12345)

In [9]:
tfidf_vectorizer  = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS,ngram_range=(1,2),max_df= 0.85, min_df= 2)

In [10]:
#takes around 2-3 mins
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_validation_tfidf = tfidf_vectorizer.transform(X_test)

In [11]:
lr = LogisticRegression()
# train our model
lr.fit(X_train_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Print most popular words/n-grams for each class/category

In [12]:
#X_train_tfidf -> transformed X
feature_names = tfidf_vectorizer.get_feature_names()

In [13]:
clf_coef = lr.coef_[0]

In [14]:
clf_coef[:10]

array([-0.40159966, -0.36543547,  0.0009544 , -0.03230355,  0.00609635,
        0.04830219,  0.01201773,  0.02882158,  0.00061854,  0.02897177])

We want to see for each class what are the top n features or ngrams present. Lets start seeing a what features (in our case words or n-grams) are used to classify a document. For that we can write a function. This will user argsort to produce the indices that would sort the given row based on td idf, then reverse them (descending order) and then select top n features. We also need to format our row (which is sparse) to a format that argsort will accept.

In [15]:
def top_n_features_doc(tfidf_row_formated, feature_names, top_n=25):
    tfidf_row = tfidf_row_formated
    
    #tfidf_row_formated = np.squeeze(tfidf_row.toarray()) # this is to format to np array that argsort will expect
    top_n_index = np.argsort(tfidf_row)[::-1][:top_n] #It returns an array of indices of the same shape as a that index data along the given axis in sorted order.
    top_feat_tuple = [(feature_names[i], tfidf_row[i]) for i in top_n_index]
    return top_feat_tuple


In [16]:
doc_no = 4
tfidf_row_formated = np.squeeze(X_train_tfidf[doc_no].toarray())
top_feat_tuple = top_n_features_doc(tfidf_row_formated, feature_names, top_n=10)

Using this we can show what top words or features are used to classify this document

In [17]:
top_feat_tuple

[('smirnov', 0.27639183875467016),
 ('mr smirnov', 0.2570028427186608),
 ('russian', 0.20214192282192292),
 ('olympic', 0.18800806724843414),
 ('doping', 0.1833163107144178),
 ('athletes', 0.16785601016497628),
 ('antidoping', 0.14856183799588057),
 ('olympic committee', 0.13903913929811618),
 ('mclaren', 0.13756086910412713),
 ('mr', 0.13720973333629302)]

In [18]:
X_train_tfidf.shape

(14628, 743712)

In [19]:
# 12.
def top_features_all_docs(X_train_tfidf, row_ids = None, tfid_limit=0.1,  top_n=25):
    # we can also fiter based on certain row ids 
    ''' returns top n features (words/n-grams) that on average are most common across all 
        all documents in our tfidf matrix if row_ids are None else it will filter based on row_ids
    '''
    if row_ids:
        all_docs = X_train_tfidf[row_ids].toarray()
    else:
        print("Coming in this loop")
        all_docs = X_train_tfidf.toarray() # convert to np array
    
    all_docs[all_docs < tfid_limit ] = 0 
    
    all_docs_mean_tfidf_mean = np.mean(all_docs, axis=0)
    
    print("Coming here", all_docs_mean_tfidf_mean.shape)
    top_features = top_n_features_doc(all_docs_mean_tfidf_mean, feature_names)
    return top_features
    

In [20]:
# 8.26
print(top_features_all_docs(X_train_tfidf))

Coming in this loop
Coming here (743712,)
[('mr', 0.021452655758997248), ('trump', 0.015939575031755707), ('mr trump', 0.009288355530070494), ('clinton', 0.009029553712913552), ('ms', 0.004065133790661662), ('said', 0.004039320096436103), ('fbi', 0.0037453680090903424), ('hillary', 0.003669984100880582), ('russia', 0.0034117748323450763), ('police', 0.0033820732282593014), ('comey', 0.0032127156240300244), ('obama', 0.0030540825336704855), ('_____', 0.002968468247471249), ('mrs', 0.002627442958253107), ('mrs clinton', 0.002482043274480689), ('la', 0.002412407430076807), ('china', 0.002379252492807141), ('israel', 0.002241969853591411), ('russian', 0.002229520997511233), ('syria', 0.0020544469890717942), ('women', 0.0020179327362084337), ('emails', 0.001842932852766811), ('percent', 0.0018143032222292976), ('que', 0.0017961415038461447), ('el', 0.0017110091040987035)]


Lets now look into our all document set to see what are the top most important words across all the documents. For that we can take average of each word (feature/ngram) tfidf across all documents(rows). Also we can remove words with low tdidf, because commanly used word such as "a", "the", "news" (as this is news dataset) will have low tfidf values within each document but they occur in most of the documents so when we calculate their average, it will be a high average and will be picked up in pur top n calculations.

Lets also examine what are the top most features for each class. We can filter based on class label from our tfidf matrix and call our top_features_all_docs method. 