## ML model to predict the type of article

1. We are providing the dataset please download it from this hyperlink articles.csv. Carefully look at the dataset information.

Apply the most compatible ML algorithm and build a text classification model.

The following tasks needs to be achieved while building a ML model

(i)  We need all the records in “Article_Description” and “Full_Article” to be cleaned.
      i.e) Remove the html tags    
(ii) Merge the columns “Heading”, “Article_Description” and “Full_Article” separated by space and place the    
      merged text in a new column name “Preprocessed_Text”    
(iii) Remove stopwords and punctuation for all the records in the column  “Preprocessed_Text”
(iv) Remove the leading or trailing whitespaces for all the records in the column  “Preprocessed_Text”
(v) Apply Feature Engineering
(vi) Build a classifier model
(vi) Save and Reload the model from disk
(v) Evaluate the model
(vi) Predict Article types the unknown_articles.csv

In [5]:
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn
from sklearn.ensemble import RandomForestClassifier

uncomment and run if models are not available

In [6]:
# import nltk
# nltk.download('stopwords') 
# nltk.download('wordnet')

In [7]:
data = pd.read_csv("articles.csv",encoding="latin")
data.head(1)

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,<p>The helicopter that crashed in Southeast Al...,<p>The helicopter that crashed in Southeast Al...,Commercial,Negative


In [8]:
data.describe(include='all')

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality
count,4305,4305,1753,4305,4305,4305,4305,3873
unique,4305,4020,1686,1762,4291,4305,7,3
top,055ce4d9-a547-44c5-9aee-284eb9d3b7ad,Boeing CEO: First Operational Self-Flying Cars...,https://cdn.aviationtoday.com/wp-content/uploa...,WeChat,<p>Airbus Helicopters has delivered the first ...,<p>US Marine wing support squadrons have a res...,Commercial,Positive
freq,1,8,3,208,2,1,2470,3286


In [9]:
data["Article.Description"][0]

'<p>The helicopter that crashed in Southeast Alaska in late September, killing three people, entered a 500-foot freefall before dropping to a Glacier Bay National Park beach, according to by the National Transportation Safety Board. The preliminary NTSB report released Friday offers no official probable cause. That determination won&lsquo;t be made until next year at the earliest.</p>'

In [10]:
data["Full_Article"][0]

'<p>The helicopter that crashed in Southeast Alaska in late September, killing three people, entered a 500-foot freefall before dropping to a Glacier Bay National Park beach, according to by the National Transportation Safety Board.&nbsp;The preliminary NTSB report released Friday offers no official probable cause. That determination won&lsquo;t be made until next year at the earliest.</p>'

In [11]:
def remove_tags(string):
    result = re.sub('<.*?>','',string)
    return result

(i) We need all the records in “Article_Description” and “Full_Article” to be cleaned. i.e) Remove the html tags

In [12]:
data["Article.Description"]=data['Article.Description'].apply(lambda x : remove_tags(x))
data["Full_Article"]=data['Full_Article'].apply(lambda x : remove_tags(x))
data.head(1)

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,The helicopter that crashed in Southeast Alask...,The helicopter that crashed in Southeast Alask...,Commercial,Negative


(ii) Merge the columns “Heading”, “Article_Description” and “Full_Article” separated by space and place the
merged text in a new column name “Preprocessed_Text”

In [13]:
data["Preprocessed_Text"] = data["Heading"]+" "+data["Article.Description"]+" "+data["Full_Article"]
data.head(1)

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality,Preprocessed_Text
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,The helicopter that crashed in Southeast Alask...,The helicopter that crashed in Southeast Alask...,Commercial,Negative,"A Puzzling Maneuver, Then Freefall: NTSB Repor..."


(iii) Remove stopwords and punctuation for all the records in the column “Preprocessed_Text”

In [14]:
def preprocess_text(text):
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stopwords
    keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
    return " ".join(keywords)

def preprocess_(text):
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stopwords
    keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
    return keywords

In [15]:
data["Preprocessed_Text"] = data['Preprocessed_Text'].apply(lambda x : preprocess_text(x))
data.head(1)

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality,Preprocessed_Text
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,The helicopter that crashed in Southeast Alask...,The helicopter that crashed in Southeast Alask...,Commercial,Negative,puzzle maneuver freefall ntsb report provide n...


 (iv) Remove the leading or trailing whitespaces for all the records in the column “Preprocessed_Text”

In [16]:
data["Preprocessed_Text"] = data['Preprocessed_Text'].apply(lambda x : x.strip())
data.head(1)

Unnamed: 0,Id,Heading,Article.Banner.Image,Outlets,Article.Description,Full_Article,Article_Type,Tonality,Preprocessed_Text
0,d6995462-5e87-453b-b64d-e9f1df6e94d2,"A Puzzling Maneuver, Then Freefall: NTSB Repor...",,Essex Caller,The helicopter that crashed in Southeast Alask...,The helicopter that crashed in Southeast Alask...,Commercial,Negative,puzzle maneuver freefall ntsb report provide n...


(v) Apply Feature Engineering

In [17]:
vectoriser = TfidfVectorizer(analyzer=preprocess_)
X_train = vectoriser.fit_transform(data['Preprocessed_Text'])
Y_train = pd.get_dummies(data["Tonality"])

(vi) Build a classifier model 

In [18]:
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

(vi) Save and Reload the model from disk

In [19]:
import pickle
pickle.dump(clf, open("rf.pickle", 'wb'))
clf_loaded = pickle.load(open("rf.pickle", 'rb'))

(v) Evaluate the model 

In [20]:
preds = clf_loaded.predict(X_train)
sklearn.metrics.accuracy_score(preds,Y_train)

0.9858304297328687

 (vi) Predict Article types the unknown_articles.csv

In [21]:
# Its has only url leyt me write the pseduocode - time constrains
# 1. Iterate the column
# 2. apply post method to retrive the website content
# 3. Using beautiful soup extract the raw text
# 4. segregate
# 5. ML process cleaning, feature , apply in the trained model