In [None]:
__author__ = "Pierre-Habté Nouvellon"
__email__ = "pierrehabte@snipfeed.co"
__status__ = "Challenge NLP"

# Intro:

One of the big challenge at Snipfeed is the NLP part. In fact, we have to develop a system that is capable of: 
- (i) recommending articles/videos to people based on their interaction
- (ii) giving context to an article (that is give content related to an article to allow the user to go deeper into a subject)
- (iii) giving the most relevant content following query (search engine).
- (iv) avoiding fake news
- (v) extracting the most important information from an article to transform it into a snip


The understanding and implementation of classic machine learning / deeplearning techniques is a requirement. This test is made to see the candidate's ability to learn new technics and apply it on a concrete examples.

# Fake news challenge

In this task you will have to build a ML model that gives a confidence score between 1 and 0 to an article (is this article reliable or not). The requirements are:

- train you model on the Kaggle fake news dataset https://www.kaggle.com/c/fake-news/data .
- provide the metrics to assess you model on the testing set (accuracy etc)
- build a final function give_score(link) that takes a link of an article as an input and outputs the confidence score. 
- compare 3 different ML technics. Parameter tuning is very important, so explain each step of the process in the notebook.

In [1]:
import pandas as pd
import numpy as np

train_set = pd.read_csv("/data/train.csv", index_col=0)
train_set = train_set.fillna(" ")

train_set.head()

Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [3]:
train_set["total"] = train_set["title"] + " " + train_set["author"] + train_set["text"]

In [4]:
"""Preping the data"""
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import cross_val_score

In [6]:
count_vectorizer = CountVectorizer(ngram_range=(1,2))
transformer = TfidfTransformer()
counts = count_vectorizer.fit_transform(train_set["total"].values)
tfidf = transformer.fit_transform(counts)

In [7]:
targets = train_set["label"].values
features = tfidf

In [11]:
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier(n_estimators=10, n_jobs=-1)
#random_forest_classifier.fit(features, targets)
accuracies = cross_val_score(estimator=random_forest_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.8577894118980136


In [12]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost_classifier = AdaBoostClassifier()
accuracies = cross_val_score(estimator=ada_boost_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.9728357258408916


In [15]:
from sklearn.ensemble import GradientBoostingClassifier
grad_boost_classifier = GradientBoostingClassifier(n_estimators=10)
accuracies = cross_val_score(estimator=grad_boost_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.943028376030238


# Technical challenge : Imitation of a function using a Neural network


In this challenge you will have to create a Recurrent Neural Networks (RNN) that takes as an input a vector $X=[x_1,x_2,..,x_{M_X}]$ of size $(1,M_X)$ (where $M_X$ can be of any size between 1 and 20 and each component $x_i$ between -10 and 10) and gives as an output the y= $\sum_{i=1}^{M_X}|x_i| $. In other words, the network will have to emulate the l1 norm.

For instance, if I give the vector $X_{example1}=[ -4;-2;3]$ as an input to the network, it has to output $4+2+3 =9$. 

If I give the vector $X_{example2} = [ -4; -2 ; 3 ; 10 ; -5]$ as an input to the network, it has to output $4+2+3+10+5 =24$

For this challenge you can use all library and tools that you want (Keras, tensorflow, pytorch..). Please put your code in this jupyter notebook. I didn't provide a training/testing set because you will have to create it yourself !

Here is a good ressource if you don't know anything about RNN: 

- https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
- https://www.cpuheater.com/deep-learning/introduction-to-recurrent-neural-networks-in-pytorch/
- https://medium.com/dair-ai/building-rnns-is-fun-with-pytorch-and-google-colab-3903ea9a3a79

In [None]:
from dataset import VectDataset, PadCollate
from model import RNNRegressor




In [None]:
model = RNNRegressor(imput_size=1, hidden_size=20, num_layers=1)
train_dataset = VecDataset(dataset_size=10000)
train_dataloader = data.DataLoader(train_dataset, batch_size=1000, collate_fn=PadCollate(dim=0))
train(model, dataloader=train_dataloader, epochs=1000, criterion=nn.MSELoss())

In [None]:
print("6")