# **Fake News**

<body>
<img src="https://conversationagent.typepad.com/.a/6a00d8341c03bb53ef0147e02d8fa5970b-pi" width="870"/>
</body>

**The Daily Inspector was a popular news agency which was once trusted for its reliable and speedy news coverage with over 4 million readers worldwide.**

**A recent lapse in the news agency’s fact-checking policy meant that over 1,000 articles went out within the last 2 months with incorrect information; leading to large criticism and a significant drop in the agency’s number of readers.**

**To rectify this, The Daily Inspector has hired you to find a way of locating these fake articles; so that they can be deleted or corrected, and suggest potential ways they can prevent such a thing from happening again.**


## **1. Import Libraries and Data needed**
<body>
<img src="https://offloadmedia.feverup.com/secretldn.com/wp-content/uploads/2016/06/18075319/Libraries-1024x901.jpg" width="600"/>
</body>

Here we are importing all the libraries that we will need to analyse the data.

Libraries contain all the little functions and tools that other programmers have created. This way we don't have to spend hours recreating code. Instead, we just call out the function name and it performs all the steps we want it to do.

In [None]:
!pip install contractions

In [None]:
## Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.stem import PorterStemmer
import re
import tensorflow
from tensorflow.keras.layers import Embedding, Dense, LSTM
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from bs4 import BeautifulSoup
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

In [None]:
## Import train dataset
df = pd.read_csv('https://raw.githubusercontent.com/ssonkol/Fake-News-Detection/main/train.csv',engine='python', encoding='utf-8', error_bad_lines=False)
## Import test dataset
test_data = pd.read_csv('https://raw.githubusercontent.com/ssonkol/Fake-News-Detection/main/test.csv',engine='python', encoding='utf-8', error_bad_lines=False)

## **2. Exploratory Data Analysis & Text Processing**

First let's inspect the data

Note: 

1 = Fake News/Article

0 = Not Fake News/Article


In [None]:
df.head()

Let's check if there are any null values in the data!

Models can't make any predictions and we can't make any useful insights with null values in our data.

In [None]:
df.isnull().sum()

#### **Data Prep**

In [None]:
# Assign nan in place of blanks in the text column
df['text'] = df['text'].str.strip()
df['text'] = df['text'].replace(r'^\s*$', np.nan, regex=True)

In [None]:
# Remove all rows where complaints column is nan
df.dropna(subset=['text'], inplace=True)

####**Check for duplicates**

Having duplicates in any dataset isn't good. But it is especially important to create effective and accurate models.

Having data with no duplicates ensures that you will develop one complete version of the truth, allowing you to base strategic decisions on accurate data.

In [None]:
df.duplicated(subset=["text"]).value_counts()

As we can see, we have few duplicate entries for text column

In [None]:
dup = df[df.duplicated(subset=["text"])]
dup.head()

In [None]:
# print one duplicate entry
df[df['text'] == dup.loc[480]['text']]

**Question**

Why would you want to remove these duplicate entries?

####**Drop duplicated Data & Nan values**

In [None]:
# drop duplicated data
df = df.drop_duplicates(subset={"text"}, keep='first', inplace=False)
df.shape

####**Replace null values**

In [None]:
# Checking for missing values in the dataset
df.isnull().sum()

In [None]:
# dropping the nan values
df = df.fillna('')

In [None]:
# Now count the Unique values to check the data is balanced or not
count = np.unique(df['label'], return_counts=True)
count

Now let's see that graphically

In [None]:
import seaborn as sns
sns.countplot(x='label', data = df)

####**Extra EDA**

Let's check the 10 authors who created the most articles with fake news and compare them to the top 20 authors who created accurate articles (not fake)

In [None]:
df.loc[df['label'] == 1]['author'].value_counts()[:10].sort_values().plot(kind = 'barh',figsize=(18,6),title="Top 10 Authors Making Fake News/Articles")

In [None]:
df.loc[df['label'] == 0]['author'].value_counts()[:10].sort_values().plot(kind = 'barh',figsize=(18,6),title="Top 10 Authors Making Factually Correct Articles")

###**Preprocessing Text**

We will perform the below preprocessing tasks:

- Convert everything to lowercase - We don't want our model to think having capitals and non capitals in text are significant indications of an article being fake or not... So let's set everything to lower case.
- Remove HTML tags
- Remove URLs from sentences
- Contraction mapping - This essentially fixes and expands shortened words - you're -> you are
- Eliminate punctuations and special characters
- Remove stopwords - words that are usually irrelevant to the meaning of a piece of text
- Stem words in text - This essentially removes the suffix of a word and gives back the base/root of a word (e.g. Flying becomes Fly)

In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
stopword_list = stopwords.words('english')
print(stopword_list)

In [None]:
# let us use the contractions package to fix and expand shortened words - you're -> you are
import contractions
def decontracted(sentance):
    expanded_words = []    
    for word in sentance.split():
      expanded_words.append(contractions.fix(word))   

    expanded_text = ' '.join(expanded_words)
    return expanded_text

In [None]:
def sentence_clean(sentence):
    # change sentence to lower case
    sentence = sentence.lower()
    # removing URL from sentence
    sentence = re.sub(r"http\S+", "", sentence)
    # removing HTML tags
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    # removing contraction of words from sentence   # call decontracted funtion for it
    sentence = decontracted(sentence)
    # removing digits
    sentence = re.sub("\S*\d\S*", "", sentence).strip()
    # removing special character
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    
    return sentence

In [None]:
# Use Stemming 
ps = PorterStemmer()

# Performing the preprocessing steps on all messages
def preprocess(document):
    preprocessed_reviews = []
    # tqdm is for printing the status bar
    for sentence in tqdm(document):
        # call sentence_clean function to clean text
        sentence = sentence_clean(sentence)
        # tokenize into words
        words = word_tokenize(sentence)
        # remove stop words
        tokens = [ps.stem(word) for word in words if word not in stopword_list]

        # join words to make sentence
        sentence = " ".join(tokens).strip()

        preprocessed_reviews.append(sentence)
        
    return preprocessed_reviews

In [None]:
%%time
corpus = preprocess(df['text'])

#### **Try to spot the differences in the text!**

What has stemming the words done?

In [None]:
print("Before preprocess\n", df['text'][1])
print("***"*40)
print("After preprocess\n", corpus[1])

In [None]:
df['text'] = corpus

## **3. Build Our Training And Test Data**
<body>
<img src="https://clearmeasure.com/wp-content/uploads/2018/11/build-1159776_960_720.jpg"/>
</body>

**In this section we are seperating our data between our X and y.**

- X will be the data the model recieves and in turn makes a prediction.

- y will be the data the model will compare its prediction to i.e. the data the model marks itself against

**Then the data will be split into training data and test data.**

Depending on how much data you have (the more the better), we split it into our training & test data. This means we create a training data to test data ratio between 0.7 to 0.3 and 0.8 to 0.2.

We want to give our model plenty of data to learn from but we also need to give it enough unseen tests for it to accurately gauge its predictive ability.



<details>
  <summary>Parameters you can change:</summary>

  - test_size - As stated before, this parameter will determine your training data to test data ratio.

    -  0.5 means that we are creating a 50/50 split between training and test data
    - 0.25 means that we are creating a 75/25 split between training and test data
</details>

In [None]:
# Seperating the data and the label 
X = df['text']
y = df['label']

In [None]:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [None]:
print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

##**4. Create our Model**
<body>
<img src="https://scx2.b-cdn.net/gfx/news/hires/2019/howtoovercom.jpg" width="870"/>
</body>

For our fake news detection model, we will be using a logistic regression model.

In simple terms,Logistic Regression is a statistical model used to model the probability of an event existing i.e. what is the probability that you will win or loose (in this example an article being real or fake)

In [None]:
def plot_confusion_matrix(y_actual, y_pred):
    '''
    This method plots confusion matrix.
    '''
    classes = ['Fake News', 'Real News']
    tick_marks = np.arange(len(classes))

    accuracy = accuracy_score(y_actual, y_pred)
    print("Accuracy score:", "{:2.3}".format(accuracy))

    conf_matrix = confusion_matrix(y_actual, y_pred)

    fig, ax = plt.subplots(figsize=(4, 4))
    ax.matshow(conf_matrix, cmap=plt.cm.Reds, alpha=0.3)
    for i in range(conf_matrix.shape[0]):
        for j in range(conf_matrix.shape[1]):
            ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center')
    
    plt.tight_layout()
    plt.xticks(tick_marks , classes, rotation=0)
    plt.yticks(tick_marks , classes)
    plt.xlabel('Predictions')
    plt.ylabel('Actuals')
    plt.title('Confusion Matrix', fontsize=12)
    plt.show()

### **TF-IDF Vectorizing**
<body>
<img src="https://miro.medium.com/max/1400/1*qQgnyPLDIkUmeZKN2_ZWbQ.png" width="870"/>
</body>


In similar terms as the image says above, **TF-IDF** (**Term Frequency- Inverse Document Frequency**), is a numerical statistic reflecting how important a word is to a document in a collection. We can use this to pick out words that appear to be important in our test article. These will be used as our features - words that the model will use to learn if an article is fake or not.


**Don't worry you don't have to know any formulas for this!!**


<details>
  <summary>Parameters you can change:</summary>

  - n_gram - This parameter picks out our features in ranges from 1 to n (1,n). 

    -  (1,1) means that only one word features will be extracted and used as features.
    -  (1,2) means that one word and two word terms will be extracted and used as features.
    -  (1,3) means that one word, two word and three word terms will be extracted and used as features.

**Example**

  Text = "I am writing this text as an example to show you the importance of tf-idf vectorizers"
  1. n_gram(1,1) might pick out words like ["writing","example"...] etc

  2. n_gram(1,2) might pick out words like ["writing", "writing this","example"...] etc

  3. n_gram(1,3) might pick out words like ["importance", importance of", importance of tf-idf"...] etc
</details>

In [None]:
#test out n_gram range
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
tfidf.fit(X) # adjust our vectorizer to the data we have

#Now convert the data into their tfidf representations
X_train_cv = tfidf.transform(X_train)
X_test_cv = tfidf.transform(X_test)

len(tfidf.get_feature_names())

In [None]:
# Create our logistic regression object/model object and fit it to our training data
model = LogisticRegression()
model.fit(X_train_cv, y_train)

In [None]:
#predict on training data
X_train_predict = model.predict(X_train_cv)
#Let's see our training accuracy
train_accuracy = accuracy_score(y_train, X_train_predict)

#predict on our test data
X_test_predict = model.predict(X_test_cv)
#Let's see our test accuracy
test_accuracy = accuracy_score(y_test, X_test_predict)

In [None]:
accuracy = accuracy_score(y_test, X_test_predict)
LR_TF_TFIDF = {'Vectorizer': 'TF-IDF', 'Algorithm': 'Logistic_Regression_1', 
               'Train Accuracy':train_accuracy, 'Test Accuracy':test_accuracy}

In [None]:
LR_TF_TFIDF

Understanding our model's performance with a confusion matrix helps us to see exactly how our model performed.
A confusion helps to identify where the model made false positives and false negatives compared to the correct answers.

In [None]:
# plot confusion matrix on test
plot_confusion_matrix(y_test, X_test_predict)

## **5. Test our model**

Let us test our model with our own examples and some examples from the training and test data!

In [None]:
def classify_message(text):
    text = tfidf.transform(text)
    predicted = model.predict(text)
    probability = model.predict_proba(text).max()*100

    if predicted==0:
      print(" I am "+ str(round(probability))+"% sure that this is not Fake news")
    else:
      print(" I am "+ str(round(probability))+"% sure that this news is Fake")

In [None]:
test2 = ["Share a certain post of Bill Gates on Facebook and he will send you money."]

In [None]:
classify_message(test2)

In [None]:
df[['title','text','label']]

In [None]:
classify_message([df['text'][1]])

###**Apply it to our test data**


In [None]:
def classify_message(text):
    text = tfidf.transform(text)
    predicted = model.predict(text)
    probability = model.predict_proba(text).max()*100

    if predicted==0:
      return("Not Fake")
    else:
      return("Fake")

In [None]:
def prediciton_score(text):
    text = tfidf.transform(text)
    predicted = model.predict(text)
    return predicted

In [None]:
#create a new column called prediction with no data
test_data['prediction'] = " "
#create a new column called prediction_score with no data
test_data['prediction_score'] = " "

In [None]:
test_data = test_data.fillna('')

In [None]:
#create a for loop which takes in data from the text column and returns a prediction as to whether the news is fake or not
for i in range (len(test_data)):
  test_data['prediction'][i] = classify_message([test_data['text'][i]])
  test_data['prediction_score'][i] = prediciton_score([test_data['text'][i]])

In [None]:
test_data