<center><h1> Fake News! </h1></center>

In this assignment, you are going to prototype a fake news detector application. The attached dataset contains headlines from online resources along with a Label indicating whether the headline represents Fake News (0) or Real News (1). Your task is to train an ML model to detect Fake News based on the text included in the headline.

Go ahead and import the dataset:

In [None]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/yildirimcaglar/yildirimcaglar.github.io/master/ds3000/fake_news_data.csv")
data

Unnamed: 0,Headline,Label
0,Says the Annies List political group supports ...,0
1,Health care reform legislation is likely to ma...,0
2,The Chicago Bears have had more starting quart...,1
3,When Mitt Romney was governor of Massachusetts...,0
4,McCain opposed a requirement that the governme...,1
...,...,...
4549,Says Barack Obama promised to halve the defici...,1
4550,I am the only senator who turned down the stat...,1
4551,There is no system to vet refugees from the Mi...,0
4552,I think its seven or eight of the California s...,0


In [None]:
# here are the target value counts
data["Label"].value_counts()

0    2501
1    2053
Name: Label, dtype: int64

### Question 1 . 

As you can see the dataset is not perfectly balanced. For a binary classification problem like this, it's better to have a roughly balanced dataset. Therefore, we will need to downsample the false headlines and use 2050 headlines from each class.

Write a function to randomly sample an equal number of true and false headlines from the data dataframe. Your function will be generic and should work with any dataframe as described and illustrated below:

- The function should receive the dataframe, name of the grouping column, and the number of samples to be drawn
- The function should return a dataframe containing an equal number (n) of each unique value contained in the grouping column (column) randomly selected from the original dataframe (df). 
- Refer to the sample function call.

Hint: You'll need to use the sample method of the dataframe object.

In [None]:
def sample_df_equally_by_group(df, column, n):
    sample_dfs = []
    col_vals = df[column].unique()
    for col in col_vals:
        sample_dfs.append(df[df[column] == col].sample(n=n))
    return pd.concat(sample_dfs)

In [None]:
final_data = sample_df_equally_by_group(df=data, column="Label", n=2050)

In [None]:
final_data

Unnamed: 0,Headline,Label
2778,We can fix our roads without raising taxes.,0
2861,The media widely overlooked comments made by f...,0
4191,Says If you compare the Portland Metro area to...,0
1922,"Obama ""shunned the opportunity to talk to sold...",0
560,Most tips left at Dunkin Donuts dont go to emp...,0
...,...,...
2931,"While introducing Donald Trump, former New Yor...",1
534,You've heard endlessly about waterboarding. It...,1
3567,Three out of the 18 benchmarks of the (GAO) ha...,1
2765,The Walton family of Walmart ... This one fami...,1


Here are the final counts in the sampled dataset:

In [None]:
final_data["Label"].value_counts()

0    2050
1    2050
Name: Label, dtype: int64

### Question 2 .

Before analyzing the data, you will produce a word cloud for the true and false headlines. A word cloud is a nice way to visualize the frequent words appearing in a piece of text. 

For this visualization, you're going to use the StyleCloud library:
 - https://github.com/minimaxir/stylecloud

You'll first need to install the library by referring to the documentation.

Study the documentation carefully. The first sample shows you how to produce a word cloud from a text file:
 - https://github.com/minimaxir/stylecloud#usage
 
Instead of specifying a file, you can also specify the text directly. For this purpose, you'll need to use the **text** keyword argument and specify the text that you want to visualize. 

By default, the word cloud is saved in the save directory as your Notebook file. Once you've executed your code, check that folder. The default file name is "stylecloud.png". You can specify the output name using the output_name keyword argument.


For this question, produce one word cloud for all true headlines (named "vis_true_headlines.png") and another for all false headlines (named "vis_false_headlines.png") contained in the final_data dataframe. The names of the files must be specified in your code.

In [None]:
import stylecloud

stylecloud.gen_stylecloud(text=' '.join(list(final_data[final_data['Label'] == 1]['Headline'])), max_words=100, output_name="vis_true_headlines.png")

stylecloud.gen_stylecloud(text=' '.join(list(final_data[final_data['Label'] == 0]['Headline'])), max_words=100, output_name="vis_false_headlines.png")

True headlines:
<img src="https://i.ibb.co/mXT08Lh/vis-true-headlines.png" alt="vis_true_headlines" border="0">

False headlines:
<img src="https://i.ibb.co/7RnrLcb/vis-false-headlines.png" alt="vis_false_headlines" border="0">

### Question 3 .
Write a function to get the features and target variables from the final_data dataframe and obtain your training and test splits. The function should receive the dataframe and the names of the feature and target columns. Then it should return the splits as shown in the sample output. Use random_state=3000 when splitting your data.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
def split_data(df, feature_column, target_column):
    return train_test_split(df[feature_column], df[target_column], random_state=3000)

In [None]:
X_train, X_test, y_train, y_test = split_data(df=final_data, 
                                              feature_column="Headline", 
                                              target_column="Label")

In [None]:
X_train.shape

(3075,)

In [None]:
X_test.shape

(1025,)

### Question 4 .

Write a function that can be used to vectorize text using the bag-of-words approach. 

- The function should receive the training set and testing set features as arguments.
- The third argument should be the vectorizer, with two possible values: 'count' for CountVectorizer and 'tfidf' for TfidfVectorizer. The default vectorizer should be tfidf.
- The function should construct the vocabulary based on the training set, which should then be used to represent both the training and testing sets. The vectorized training and testing sets should be returned as a tuple at the end.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def text_vectorizer(train_set, test_set, vectorizer):
    if vectorizer == 'count':
        vect = CountVectorizer().fit(train_set)
    else:
        vect = TfidfVectorizer().fit(X_train)
    train_vect = vect.transform(train_set)
    test_vect = vect.transform(test_set)
    return (train_vect, test_vect)

In [None]:
X_train_vectorized, X_test_vectorized = text_vectorizer(train_set=X_train, 
                                                        test_set=X_test, 
                                                        vectorizer = "tfidf")

In [None]:
X_train_vectorized.toarray().shape

(3075, 6867)

In [None]:
X_test_vectorized.toarray().shape

(1025, 6867)

### Question 5 .

Write a code snippet to apply LogisticRegression, MultinomialNB, and DecisionTreeClassifier algorithms to the vectorized data. The model performance must be evaluated on the testing set. Your code must use an iteration statement to apply and evaluate multiple algorithms. Refer to the sample output.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [None]:
classifiers = {'Logistic regression': LogisticRegression(), 
               'Multinomial naive bayes': MultinomialNB(), 
               'Decision tree': DecisionTreeClassifier()}

for classifier_name, classifier in classifiers.items():
    model = classifier.fit(X=X_train_vectorized, y=y_train)
    print(classifier_name)
    print("\tClassification accuracy on training set: ", model.score(X_train_vectorized, y_train))
    print("\tClassification accuracy on testing set: ", model.score(X_test_vectorized, y_test))
    print('\n')

Logistic regression
	Classification accuracy on training set:  0.8653658536585366
	Classification accuracy on testing set:  0.624390243902439


Multinomial naive bayes
	Classification accuracy on training set:  0.8878048780487805
	Classification accuracy on testing set:  0.6039024390243902


Decision tree
	Classification accuracy on training set:  1.0
	Classification accuracy on testing set:  0.551219512195122




### Question 6 .

Based on the quick results from the previous question, it seems that the Logistic Regression is probably the best choice for this problem. For this question, you will train a Logistic Regression algorithm again, but this time you'll modify some of the parameters when extracting the features. More specifically, you should include a word in your vocabulary if it has a minimum document frequency of 5. You should also extract ngrams up to bigrams (which includes both unigrams and bigrams). Finally, you should eliminate English stop words from your vocabulary.

Refer to the sample output showing model performance after modifying these parameters.

In [None]:
#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2), stop_words='english').fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = LogisticRegression().fit(X=X_train_vectorized, y=y_train)

print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))


Classification accuracy on training set:  0.7947967479674797
Classification accuracy on testing set:  0.5863414634146341


### Question 7 .

In this last question, you will write a function that can take a headline as a string and return whether it's Real or Fake News. The function should also return the probability associated with the decision, as a measure of the confidence in the prediction.

The prediction must be based on the Logistic Regression model trained in the previous question.

Refer to the sample function calls.

In [None]:
def headline_checker(headline):
    headline = vect.transform(headline)
    pred = model.predict(headline)
    news = 'Real News' if pred == 1 else 'Fake News'
    probs = model.predict_proba(headline)
    prob = probs[0][1] if pred == 1 else probs[0][0]
    print(f'Model classification: {news}')
    print(f'Probability: {prob:.2f}')

In [None]:
headline_checker(["The State adds new vaccine requirement for senate members"])

Model classification: Real News
Probability: 0.70


In [None]:
headline_checker(["Wisconsin Governer says he will never campaign again"])

Model classification: Fake News
Probability: 0.83
