# NLP: News

News aims to revolutionize the
way Indians perceive finance, business, and capital market
investment, by giving it a boost through artificial intelligence (AI) and
machine learning (ML). The goal of this project is to use a bunch of news articles extracted
from the companies’ internal database and categorize them into
several categories like politics, technology, sports, business and
entertainment based on their content.

# 1.0 Import Data & Explore Characteristic

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd /content/drive/MyDrive/flip

/content/drive/MyDrive/flip


In [None]:
import numpy as np
import pandas as pd
import re
import string

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
df= pd.read_csv("News-data.csv")
df.head()

Unnamed: 0,Category,Article
0,Technology,tv future in the hands of viewers with home th...
1,Business,worldcom boss left books alone former worldc...
2,Sports,tigers wary of farrell gamble leicester say ...
3,Sports,yeading face newcastle in fa cup premiership s...
4,Entertainment,ocean s twelve raids box office ocean s twelve...


# 2.0 Exploratory Data Analysis

In [None]:
#Shape of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  2225 non-null   object
 1   Article   2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [None]:
df.shape

(2225, 2)

In [None]:
#News articles per category
df['Category'].value_counts()

Sports           511
Business         510
Politics         417
Technology       401
Entertainment    386
Name: Category, dtype: int64

# 3. Process Textual Data

In [None]:
def clean_text(text):
  '''preprocess function for text'''

  #convert to lower case
  text = text.lower()

  # remove puctuations
  punctuations = string.punctuation
  text = text.translate(str.maketrans('', '', punctuations))

  # remove special characters & numbers
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = re.sub('\s+', ' ', text)

  #tokenize text
  word_tokens = nltk.word_tokenize(text)

  #remove stopwords
  stop_word = stopwords.words('english')
  clean_wd =  [word for word in word_tokens if word not in stop_word]

  # Lemmatize words
  lemmatizer = WordNetLemmatizer()

  words =[lemmatizer.lemmatize(word) for word in clean_wd]

  return words


In [None]:
df["cleaned"] = df["Article"].apply(lambda text: clean_text(text))
df.head()

Unnamed: 0,Category,Article,cleaned
0,Technology,tv future in the hands of viewers with home th...,"[tv, future, hand, viewer, home, theatre, syst..."
1,Business,worldcom boss left books alone former worldc...,"[worldcom, bos, left, book, alone, former, wor..."
2,Sports,tigers wary of farrell gamble leicester say ...,"[tiger, wary, farrell, gamble, leicester, say,..."
3,Sports,yeading face newcastle in fa cup premiership s...,"[yeading, face, newcastle, fa, cup, premiershi..."
4,Entertainment,ocean s twelve raids box office ocean s twelve...,"[ocean, twelve, raid, box, office, ocean, twel..."


In [None]:
df['Article'][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

In [None]:
df['cleaned'][0]

['tv',
 'future',
 'hand',
 'viewer',
 'home',
 'theatre',
 'system',
 'plasma',
 'highdefinition',
 'tv',
 'digital',
 'video',
 'recorder',
 'moving',
 'living',
 'room',
 'way',
 'people',
 'watch',
 'tv',
 'radically',
 'different',
 'five',
 'year',
 'time',
 'according',
 'expert',
 'panel',
 'gathered',
 'annual',
 'consumer',
 'electronics',
 'show',
 'la',
 'vega',
 'discus',
 'new',
 'technology',
 'impact',
 'one',
 'favourite',
 'pastime',
 'u',
 'leading',
 'trend',
 'programme',
 'content',
 'delivered',
 'viewer',
 'via',
 'home',
 'network',
 'cable',
 'satellite',
 'telecom',
 'company',
 'broadband',
 'service',
 'provider',
 'front',
 'room',
 'portable',
 'device',
 'one',
 'talkedabout',
 'technology',
 'ce',
 'digital',
 'personal',
 'video',
 'recorder',
 'dvr',
 'pvr',
 'settop',
 'box',
 'like',
 'u',
 'tivo',
 'uk',
 'sky',
 'system',
 'allow',
 'people',
 'record',
 'store',
 'play',
 'pause',
 'forward',
 'wind',
 'tv',
 'programme',
 'want',
 'essentially',

# 4. Encode and Transform data

In [None]:
le = preprocessing.LabelEncoder()

# Encode labels target column 'Category'.
df['Category']= le.fit_transform(df['Category'])

#https://stackoverflow.com/questions/42196589/any-way-to-get-mappings-of-a-label-encoder-in-python-pandas
le_map = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_map)

{'Business': 0, 'Entertainment': 1, 'Politics': 2, 'Sports': 3, 'Technology': 4}


# Create text features for BOW , TFIDF

In [None]:
def make_vect(text,type):
  '''create feature vector using bow/tfidf '''
  if type == 'bow':
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(text.astype(str))
  elif type == 'tfidf':
    #https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/
    # create tfidf object
    tfidf = TfidfVectorizer()
    # get tf-df values
    X = tfidf.fit_transform(text.astype(str))
  else:
    print("invalid parameter")

  return X

In [None]:
result = make_vect(df['cleaned'],'tfidf')
result

<2225x27154 sparse matrix of type '<class 'numpy.float64'>'
	with 326439 stored elements in Compressed Sparse Row format>

In [None]:
#https://saturncloud.io/blog/how-to-use-sklearn-traintestsplit-on-pandas-stratify-by-multiple-columns/
# Define the features and target
X=result
y = df['Category']

# Split the data into training and testing sets stratified by multiple columns
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=df[['Category']]
)

In [None]:
X_train.shape,X_test.shape

((1668, 27154), (557, 27154))

# 5. Model Training & Evaluation

In [None]:
#naive bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit( X_train,  y_train )

In [None]:
#predicted y
y_pred = naive_bayes_classifier.predict(X_test)

print("Classification matrix:")
print(metrics.classification_report( y_test, y_pred,target_names=le_map.keys()))

print("Confusion matrix:")
print(metrics.confusion_matrix(y_test, y_pred))

print('f1_score:',metrics.f1_score(y_test, y_pred, average="macro"))

Classification matrix:
               precision    recall  f1-score   support

     Business       0.97      0.98      0.97       128
Entertainment       1.00      0.92      0.96        97
     Politics       0.92      0.98      0.95       104
       Sports       0.99      1.00      1.00       128
   Technology       0.98      0.97      0.97       100

     accuracy                           0.97       557
    macro avg       0.97      0.97      0.97       557
 weighted avg       0.97      0.97      0.97       557

Confusion matrix:
[[125   0   2   0   1]
 [  1  89   6   0   1]
 [  1   0 102   1   0]
 [  0   0   0 128   0]
 [  2   0   1   0  97]]
f1_score: 0.9699144847608379


# Model Function

In [None]:
def data_model(model,X_train, y_train,X_test, y_test):
  pipe = Pipeline([ ("clf", 'model' )])
  print('classifier:',model)
  print('='*30)
  pipe.set_params(clf = model)

  #fitting data into the model
  pipe.fit(X_train, y_train)

  # predicting values
  y_pred = pipe.predict(X_test)

  print("Classification matrix:")
  print(metrics.classification_report( y_test, y_pred,target_names=le_map.keys()))

  print("Confusion matrix:")
  print(metrics.confusion_matrix(y_test, y_pred))

  result = pipe.score(X_test, y_test)
  print("Accuracy: %.2f%%" % (result*100.0))

  #https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case
  print('f1_score:',metrics.f1_score(y_test, y_pred, average="macro"))



In [None]:
data_model(KNeighborsClassifier(n_neighbors=3),X_train, y_train,X_test, y_test)

classifier: KNeighborsClassifier(n_neighbors=3)
Classification matrix:
               precision    recall  f1-score   support

     Business       0.89      0.91      0.90       128
Entertainment       0.97      0.88      0.92        97
     Politics       0.84      0.92      0.88       104
       Sports       0.98      0.96      0.97       128
   Technology       0.92      0.91      0.91       100

     accuracy                           0.92       557
    macro avg       0.92      0.92      0.92       557
 weighted avg       0.92      0.92      0.92       557

Confusion matrix:
[[116   0   9   1   2]
 [  3  85   4   0   5]
 [  4   2  96   1   1]
 [  5   0   0 123   0]
 [  3   1   5   0  91]]
Accuracy: 91.74%
f1_score: 0.9164621279515656


In [None]:
data_model(DecisionTreeClassifier(),X_train, y_train,X_test, y_test)

classifier: DecisionTreeClassifier()
Classification matrix:
               precision    recall  f1-score   support

     Business       0.82      0.78      0.80       128
Entertainment       0.91      0.86      0.88        97
     Politics       0.77      0.79      0.78       104
       Sports       0.88      0.94      0.91       128
   Technology       0.83      0.85      0.84       100

     accuracy                           0.84       557
    macro avg       0.84      0.84      0.84       557
 weighted avg       0.84      0.84      0.84       557

Confusion matrix:
[[100   2  16   4   6]
 [  3  83   6   1   4]
 [  7   2  82   8   5]
 [  3   3   0 120   2]
 [  9   1   2   3  85]]
Accuracy: 84.38%
f1_score: 0.8429212343726775


In [None]:
data_model(RandomForestClassifier(),X_train, y_train,X_test, y_test)

classifier: RandomForestClassifier()
Classification matrix:
               precision    recall  f1-score   support

     Business       0.94      0.98      0.96       128
Entertainment       0.97      0.95      0.96        97
     Politics       0.98      0.94      0.96       104
       Sports       0.98      1.00      0.99       128
   Technology       0.97      0.95      0.96       100

     accuracy                           0.97       557
    macro avg       0.97      0.96      0.96       557
 weighted avg       0.97      0.97      0.97       557

Confusion matrix:
[[125   0   2   0   1]
 [  4  92   0   1   0]
 [  1   1  98   2   2]
 [  0   0   0 128   0]
 [  3   2   0   0  95]]
Accuracy: 96.59%
f1_score: 0.9649970002404078


The performance of RandomForest model can be further improved by tuning the model

# Conclusion

 F1 metrics of different models are

|Model |F1 Score|
|-----|-----:|
|Naive Bayes |0.9699  |
|KNeighborsClassifier|0.9165|
|DecisionTreeClassifier |0.8429  |
|RandomForestClassifier|0.9645|  

The best model is Naive Bayes with F1 score = 0.9699



# Questionnaire:
1. How many news articles are present in the dataset that we have?   
2225 entries    

2. Most of the news articles are from <u>Sports </u>   category      
3. Only <u>401</u> no. of articles belong to the ‘Technology’ category   
4. What are Stop Words and why should they be removed from the
text data?    
Stop words are commonly used words in any language. if commonly used words are  removed, we can focus on the important words instead.  

5.  Explain the difference between Stemming and Lemmatization.   
Stemming truncates word endings without considering linguistic context, so it computationally faster. Lemmatization analyzes word forms to determine the base or dictionary form, which takes more processing time.    

6. Which of the techniques Bag of Words or TF-IDF is considered to
be more efficient than the other?    
TF-IDF is more efficinient since it gives importance to rare words and ignores common words.   

7. What’s the shape of train & test data sets after performing a
75:25 split.   
X_train.shape=(1668, 27154)     
X_test.shape= (557, 27154)    
8. Which of the following is found to be the best performing model.
a. Random Forest
b. Nearest Neighbors
c. Naive Bayes
best model is Naive Bayes with F1 score = 0.9699     

9. According to this particular use case, both precision and recall
are equally important     
True   


