In [None]:
__author__ = "Pierre-Habté Nouvellon"
__email__ = "pierrehabte@snipfeed.co"
__status__ = "Challenge NLP"

# Intro:

One of the big challenge at Snipfeed is the NLP part. In fact, we have to develop a system that is capable of: 
- (i) recommending articles/videos to people based on their interaction
- (ii) giving context to an article (that is give content related to an article to allow the user to go deeper into a subject)
- (iii) giving the most relevant content following query (search engine).
- (iv) avoiding fake news
- (v) extracting the most important information from an article to transform it into a snip


The understanding and implementation of classic machine learning / deeplearning techniques is a requirement. This test is made to see the candidate's ability to learn new technics and apply it on a concrete examples.

# Fake news challenge

In this task you will have to build a ML model that gives a confidence score between 1 and 0 to an article (is this article reliable or not). The requirements are:

- train you model on the Kaggle fake news dataset https://www.kaggle.com/c/fake-news/data .
- provide the metrics to assess you model on the testing set (accuracy etc)
- build a final function give_score(link) that takes a link of an article as an input and outputs the confidence score. 
- compare 3 different ML technics. Parameter tuning is very important, so explain each step of the process in the notebook.

In [25]:
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [65]:
import pandas as pd
import numpy as np

train_set = pd.read_csv("/data/train.csv", index_col=0)
train_set = train_set.fillna(" ")

train_set.head()

Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [66]:
train_set["total"] = train_set["title"] + " " + train_set["author"] + train_set["text"]

In [67]:
"""Preping the data"""
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import cross_val_score

In [68]:
count_vectorizer = CountVectorizer(ngram_range=(1,2))
transformer = TfidfTransformer()
counts = count_vectorizer.fit_transform(train_set["total"].values)
tfidf = transformer.fit_transform(counts)

In [69]:
targets = train_set["label"].values
features = tfidf

In [6]:
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier(n_estimators=10, n_jobs=-1)
#random_forest_classifier.fit(features, targets)
accuracies = cross_val_score(estimator=random_forest_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.8554792000962712


In [8]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost_classifier = AdaBoostClassifier()
accuracies = cross_val_score(estimator=ada_boost_classifier, X=features, y=targets, cv=10, n_jobs=-1)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.9728357258408916


In [15]:
from sklearn.ensemble import GradientBoostingClassifier
grad_boost_classifier = GradientBoostingClassifier(n_estimators=10)
accuracies = cross_val_score(estimator=grad_boost_classifier, X=features, y=targets, cv=10, n_jobs=-1)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.943028376030238


In [101]:
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=10, weights="distance", n_jobs=-1)
accuracies = cross_val_score(estimator=knn_classifier, X=features, y=targets, cv=10, n_jobs=-1)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.7810553964518404


In [109]:
from sklearn.ensemble import ExtraTreesClassifier
extra_trees_classifier = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
accuracies = cross_val_score(estimator=extra_trees_classifier, X=features, y=targets, cv=10, n_jobs=-1)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.9215873134887431


In [70]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()
model = model.fit(features, targets)
#count_vectorizer = CountVectorizer(ngram_range=(1,2))
#transformer = TfidfTransformer()
#counts = count_vectorizer.fit_transform(train_set["total"].values)
#tfidf = transformer.fit_transform(counts)

In [77]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10)
classifier = classifier.fit(features, targets)

In [102]:
from sklearn.ensemble import GradientBoostingClassifier
grad_boost_classifier = GradientBoostingClassifier(n_estimators=10)
grad_boost_classifier = grad_boost_classifier.fit(features, targets)

In [110]:
from sklearn.ensemble import ExtraTreesClassifier
extra_trees_classifier = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
extra_trees_classifier = extra_trees_classifier.fit(features, targets)

In [37]:
from article_utils import rate_articles

In [112]:
articles_url_list = [
    "https://www.huffpost.com/entry/britishisms-in-american-english_n_5b69a9ede4b0b15abaa73cff",
    "https://www.huffingtonpost.com/entry/trump-border-patrol-agent-english_us_5b7b1aefe4b0a5b1febdf13e",
    "https://www.huffingtonpost.com/entry/high-school-teachers-are-using-dystopian-books-to-explore-the-state-of-america-today_us_58fe712ae4b018a9ce5ddd7b",
    "https://www.breitbart.com/news/migrant-caravan-fragments-as-hundreds-return-to-the-road/",
    "https://www.breitbart.com/politics/2018/11/09/incoming-house-dems-same-gun-control-failed-stop-california-massacre/",
    "https://www.breitbart.com/africa/2018/11/09/chinese-pork-company-paying-its-debt-in-ham/",
    "https://www.nytimes.com/2018/11/10/world/middleeast/jamal-khashoggi-murder-turkey-recordings.html",
    "https://www.nytimes.com/2018/11/11/world/middleeast/saudi-iran-assassinations-mohammed-bin-salman.html",
    "https://politics.theonion.com/new-trump-campaign-ad-claims-that-illegal-immigrants-cu-1830183720",
    "https://politics.theonion.com/sarah-huckabee-sanders-denies-doctoring-footage-showing-1830314031",
]


reliability_scores = rate_articles(articles_url_list, extra_trees_classifier, count_vectorizer, transformer)

Classifier:	 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


13 Britishisms That Have Invaded American English
By:	 Caroline Bologna & Culture & Parenting Reporter
Score:	 0.72


Trump Praises Border Patrol Agent Who 'Speaks Perfect English’ At Immigration Event
By:	 Nina Golgowski & General Assignment Reporter
Score:	 0.42


How High School Teachers Are Using Dystopian Books To Explore The State Of America Today
By:	 Maddie Crum & Books & Culture Reporter
Score:	 0.66


Migrant caravan fragments as hundreds return to the road
By:	 
Score:	 0.58


Incoming House Democrats Promise Same Gun Control That Failed to Stop California Massac

['House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'
 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
 'Why the Truth Might Get You Fired' ...
 'Macy’s Is Said to Receive Takeover Approach by Hudson’s Bay - The New York Times'
 'NATO, Russia To Hold Parallel Exercises In Balkans'
 'What Keeps the F-35 Alive']


# Technical challenge : Imitation of a function using a Neural network


In this challenge you will have to create a Recurrent Neural Networks (RNN) that takes as an input a vector $X=[x_1,x_2,..,x_{M_X}]$ of size $(1,M_X)$ (where $M_X$ can be of any size between 1 and 20 and each component $x_i$ between -10 and 10) and gives as an output the y= $\sum_{i=1}^{M_X}|x_i| $. In other words, the network will have to emulate the l1 norm.

For instance, if I give the vector $X_{example1}=[ -4;-2;3]$ as an input to the network, it has to output $4+2+3 =9$. 

If I give the vector $X_{example2} = [ -4; -2 ; 3 ; 10 ; -5]$ as an input to the network, it has to output $4+2+3+10+5 =24$

For this challenge you can use all library and tools that you want (Keras, tensorflow, pytorch..). Please put your code in this jupyter notebook. I didn't provide a training/testing set because you will have to create it yourself !

Here is a good ressource if you don't know anything about RNN: 

- https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
- https://www.cpuheater.com/deep-learning/introduction-to-recurrent-neural-networks-in-pytorch/
- https://medium.com/dair-ai/building-rnns-is-fun-with-pytorch-and-google-colab-3903ea9a3a79

In [114]:
from dataset import VectDataset, PadCollate
from model import RNNRegressor
import torch.nn as nn
import torch.utils.data as data
from train import train

model = RNNRegressor(input_size=1, hidden_size=30, num_layers=3)
train_dataset = VectDataset(dataset_size=10000)
train_dataloader = data.DataLoader(train_dataset, batch_size=1000, collate_fn=PadCollate(dim=0))
train(model, dataloader=train_dataloader, epochs=1000, criterion=nn.MSELoss())


CUDA
[Epoch 0, Batch 0/10]:  [Loss: 3991.18]


  epoch, ind_batch, len(dataloader), loss.data[0]


[Epoch 1, Batch 0/10]:  [Loss: 3988.92]
[Epoch 2, Batch 0/10]:  [Loss: 3986.72]
[Epoch 3, Batch 0/10]:  [Loss: 3984.53]
[Epoch 4, Batch 0/10]:  [Loss: 3982.09]
[Epoch 5, Batch 0/10]:  [Loss: 3979.28]
[Epoch 6, Batch 0/10]:  [Loss: 3976.14]
[Epoch 7, Batch 0/10]:  [Loss: 3972.91]
[Epoch 8, Batch 0/10]:  [Loss: 3969.53]
[Epoch 9, Batch 0/10]:  [Loss: 3965.93]
[Epoch 10, Batch 0/10]:  [Loss: 3962.12]
[Epoch 11, Batch 0/10]:  [Loss: 3957.95]
[Epoch 12, Batch 0/10]:  [Loss: 3953.15]
[Epoch 13, Batch 0/10]:  [Loss: 3947.44]
[Epoch 14, Batch 0/10]:  [Loss: 3940.04]
[Epoch 15, Batch 0/10]:  [Loss: 3929.98]
[Epoch 16, Batch 0/10]:  [Loss: 3915.66]
[Epoch 17, Batch 0/10]:  [Loss: 3893.30]
[Epoch 18, Batch 0/10]:  [Loss: 3853.66]
[Epoch 19, Batch 0/10]:  [Loss: 3767.65]
[Epoch 20, Batch 0/10]:  [Loss: 3519.99]
[Epoch 21, Batch 0/10]:  [Loss: 2707.51]
[Epoch 22, Batch 0/10]:  [Loss: 824.94]
[Epoch 23, Batch 0/10]:  [Loss: 741.57]
[Epoch 24, Batch 0/10]:  [Loss: 680.49]
[Epoch 25, Batch 0/10]:  [Lo

[Epoch 209, Batch 0/10]:  [Loss: 5.23]
[Epoch 210, Batch 0/10]:  [Loss: 5.19]
[Epoch 211, Batch 0/10]:  [Loss: 5.16]
[Epoch 212, Batch 0/10]:  [Loss: 5.13]
[Epoch 213, Batch 0/10]:  [Loss: 5.09]
[Epoch 214, Batch 0/10]:  [Loss: 5.06]
[Epoch 215, Batch 0/10]:  [Loss: 5.03]
[Epoch 216, Batch 0/10]:  [Loss: 4.99]
[Epoch 217, Batch 0/10]:  [Loss: 4.96]
[Epoch 218, Batch 0/10]:  [Loss: 4.93]
[Epoch 219, Batch 0/10]:  [Loss: 4.90]
[Epoch 220, Batch 0/10]:  [Loss: 4.88]
[Epoch 221, Batch 0/10]:  [Loss: 4.85]
[Epoch 222, Batch 0/10]:  [Loss: 4.82]
[Epoch 223, Batch 0/10]:  [Loss: 4.79]
[Epoch 224, Batch 0/10]:  [Loss: 4.76]
[Epoch 225, Batch 0/10]:  [Loss: 4.74]
[Epoch 226, Batch 0/10]:  [Loss: 4.71]
[Epoch 227, Batch 0/10]:  [Loss: 4.69]
[Epoch 228, Batch 0/10]:  [Loss: 4.66]
[Epoch 229, Batch 0/10]:  [Loss: 4.63]
[Epoch 230, Batch 0/10]:  [Loss: 4.61]
[Epoch 231, Batch 0/10]:  [Loss: 4.58]
[Epoch 232, Batch 0/10]:  [Loss: 4.56]
[Epoch 233, Batch 0/10]:  [Loss: 4.53]
[Epoch 234, Batch 0/10]: 

[Epoch 421, Batch 0/10]:  [Loss: 1.32]
[Epoch 422, Batch 0/10]:  [Loss: 1.32]
[Epoch 423, Batch 0/10]:  [Loss: 1.31]
[Epoch 424, Batch 0/10]:  [Loss: 1.31]
[Epoch 425, Batch 0/10]:  [Loss: 1.30]
[Epoch 426, Batch 0/10]:  [Loss: 1.30]
[Epoch 427, Batch 0/10]:  [Loss: 1.29]
[Epoch 428, Batch 0/10]:  [Loss: 1.29]
[Epoch 429, Batch 0/10]:  [Loss: 1.28]
[Epoch 430, Batch 0/10]:  [Loss: 1.28]
[Epoch 431, Batch 0/10]:  [Loss: 1.27]
[Epoch 432, Batch 0/10]:  [Loss: 1.27]
[Epoch 433, Batch 0/10]:  [Loss: 1.26]
[Epoch 434, Batch 0/10]:  [Loss: 1.26]
[Epoch 435, Batch 0/10]:  [Loss: 1.25]
[Epoch 436, Batch 0/10]:  [Loss: 1.25]
[Epoch 437, Batch 0/10]:  [Loss: 1.24]
[Epoch 438, Batch 0/10]:  [Loss: 1.24]
[Epoch 439, Batch 0/10]:  [Loss: 1.23]
[Epoch 440, Batch 0/10]:  [Loss: 1.23]
[Epoch 441, Batch 0/10]:  [Loss: 1.23]
[Epoch 442, Batch 0/10]:  [Loss: 1.22]
[Epoch 443, Batch 0/10]:  [Loss: 1.22]
[Epoch 444, Batch 0/10]:  [Loss: 1.21]
[Epoch 445, Batch 0/10]:  [Loss: 1.21]
[Epoch 446, Batch 0/10]: 

[Epoch 633, Batch 0/10]:  [Loss: 0.69]
[Epoch 634, Batch 0/10]:  [Loss: 0.68]
[Epoch 635, Batch 0/10]:  [Loss: 0.68]
[Epoch 636, Batch 0/10]:  [Loss: 0.68]
[Epoch 637, Batch 0/10]:  [Loss: 0.68]
[Epoch 638, Batch 0/10]:  [Loss: 0.68]
[Epoch 639, Batch 0/10]:  [Loss: 0.68]
[Epoch 640, Batch 0/10]:  [Loss: 0.67]
[Epoch 641, Batch 0/10]:  [Loss: 0.67]
[Epoch 642, Batch 0/10]:  [Loss: 0.67]
[Epoch 643, Batch 0/10]:  [Loss: 0.67]
[Epoch 644, Batch 0/10]:  [Loss: 0.67]
[Epoch 645, Batch 0/10]:  [Loss: 0.66]
[Epoch 646, Batch 0/10]:  [Loss: 0.66]
[Epoch 647, Batch 0/10]:  [Loss: 0.66]
[Epoch 648, Batch 0/10]:  [Loss: 0.66]
[Epoch 649, Batch 0/10]:  [Loss: 0.66]
[Epoch 650, Batch 0/10]:  [Loss: 0.65]
[Epoch 651, Batch 0/10]:  [Loss: 0.65]
[Epoch 652, Batch 0/10]:  [Loss: 0.65]
[Epoch 653, Batch 0/10]:  [Loss: 0.65]
[Epoch 654, Batch 0/10]:  [Loss: 0.65]
[Epoch 655, Batch 0/10]:  [Loss: 0.64]
[Epoch 656, Batch 0/10]:  [Loss: 0.64]
[Epoch 657, Batch 0/10]:  [Loss: 0.64]
[Epoch 658, Batch 0/10]: 

[Epoch 845, Batch 0/10]:  [Loss: 0.34]
[Epoch 846, Batch 0/10]:  [Loss: 0.34]
[Epoch 847, Batch 0/10]:  [Loss: 0.34]
[Epoch 848, Batch 0/10]:  [Loss: 0.34]
[Epoch 849, Batch 0/10]:  [Loss: 0.34]
[Epoch 850, Batch 0/10]:  [Loss: 0.34]
[Epoch 851, Batch 0/10]:  [Loss: 0.33]
[Epoch 852, Batch 0/10]:  [Loss: 0.33]
[Epoch 853, Batch 0/10]:  [Loss: 0.33]
[Epoch 854, Batch 0/10]:  [Loss: 0.33]
[Epoch 855, Batch 0/10]:  [Loss: 0.33]
[Epoch 856, Batch 0/10]:  [Loss: 0.33]
[Epoch 857, Batch 0/10]:  [Loss: 0.33]
[Epoch 858, Batch 0/10]:  [Loss: 0.33]
[Epoch 859, Batch 0/10]:  [Loss: 0.33]
[Epoch 860, Batch 0/10]:  [Loss: 0.33]
[Epoch 861, Batch 0/10]:  [Loss: 0.33]
[Epoch 862, Batch 0/10]:  [Loss: 0.33]
[Epoch 863, Batch 0/10]:  [Loss: 0.33]
[Epoch 864, Batch 0/10]:  [Loss: 0.33]
[Epoch 865, Batch 0/10]:  [Loss: 0.32]
[Epoch 866, Batch 0/10]:  [Loss: 0.32]
[Epoch 867, Batch 0/10]:  [Loss: 0.32]
[Epoch 868, Batch 0/10]:  [Loss: 0.32]
[Epoch 869, Batch 0/10]:  [Loss: 0.32]
[Epoch 870, Batch 0/10]: 

In [3]:
from test import test

test_dataset = VectDataset(dataset_size=2000)
test_dataloader = data.DataLoader(test_dataset, batch_size=500, collate_fn=PadCollate(dim=0))
test(model, dataloader=test_dataloader, criterion=nn.MSELoss())


Running on cuda
cuda
MSELoss test loss = 0.217


6
