## Fake news Detection 

#### Fake news refers to misinformation or disinformation spread through word of mouth,media or digital channels such as edited videos/pics,memes,unverified sources and social media rumours or deep fake videos. 
Fake news has become a serious problem with potential of resulting in mob violence, public shame or any kind of social stigma. Or even propaganda like favouring politicians in elections or spreading inflammatory messages of terrorists,crimes etc.

Many examples  are there in the past and currently .The problem statement dataset has been taken from kaggle Dataset :https://www.kaggle.com/pnkjgpt/fake-news-dataset

### Data science project Documentation

The Data Science Method

<b>1.Problem Identification</b>

<b>2.Data Wrangling</b>
* Data Collection
    - Locating the data
    - Data loading
    - Data joining
* Data Organization
    - File structure
    - Git & Github
* Data Definition
    * Column names
    - Data types (numeric, categorical, timestamp, etc.)
    - Description of the columns
    - Count or percent per unique values or codes (including NA)
    - The range of values or codes
* Data Cleaning
    * NA or missing data
    - Duplicates

<b>3.Exploratory Data Analysis</b>

- Build data profile tables and plots
- Outliers & Anomalies
- Explore data relationships
- Identification and creation of features

<b>4.Pre-processing and Training Data Development</b>

- Create dummy or indicator features for categorical variables
- Standardize the magnitude of numeric features
- Split into testing and training datasets
- Apply scaler to the testing set

<b>5.Modeling</b>

- Fit Models with Training Data Set
- Review Model Outcomes — Iterate over additional models as needed.
- Identify the Final Model

<b>6.Documentation </b>

- Review the Results
- Present and share your findings - storytelling
- Finalize Code
- Finalize Documentation

## 1.Problem Identification

#### To predict whether a news is fake or real

## 2.Data Wrangling

### 2.1 Import required libraries

In [1]:
#Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import nltk
from nltk.stem import WordNetLemmatizer
import re
from wordcloud import WordCloud, STOPWORDS
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### 2.2 Data Collection
- Locating the data
- Data loading
- Data joining

In [2]:
#locating the dataset
import os
print(os.name)
print(os.getcwd())
print(os.listdir())

nt
C:\Users\Sanjay\3datascienceprojects\datascienceprojects\NLP\Fake News\fake_news_detection\notebooks
['.gitkeep', '.ipynb_checkpoints', 'concepts', 'Fake News detection Project.ipynb', 'fakenewsdetection.ipynb', 'fake_news_train_data_eda.html', 'pipeline.sav', 'Project on Fake News .ipynb']


###  Here cookiecutter template is used for organing projects directories and files.

In [3]:
#loading  the datasets
data_set_path =("C:/Users/Sanjay/3datascienceprojects/datascienceprojects/NLP/Fake News/fake_news_detection/data/raw/")

fake_news_train_data = pd.read_csv(data_set_path+"train.csv")
fake_news_test_data =pd.read_csv(data_set_path+'test.csv')

In [4]:
fake_news_train_data.info()   #train dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   index       40000 non-null  int64 
 1   title       40000 non-null  object
 2   text        40000 non-null  object
 3   subject     40000 non-null  object
 4   date        40000 non-null  object
 5   class       40000 non-null  object
 6   Unnamed: 6  1 non-null      object
dtypes: int64(1), object(6)
memory usage: 2.1+ MB


In [None]:
fake_news_test_data.info() ### test data

In [None]:
print("Fake news dataset to be trained for fake or real news:",fake_news_train_data.shape)  ## train dataset
print("Test dataset to be predict the news is fake or real:", fake_news_test_data.shape)  ## test dataset

####  Train dataset has 40k rows and 7 columns while test data has 4k rows and 5 columns

In [None]:
fake_news_test_data.columns

In [None]:
fake_news_train_data.head()

In [None]:
fake_news_train_data.head(5).T

In [None]:
fake_news_train_data.describe()

In [None]:
fake_news_train_data.isna().sum()

In [None]:
fake_news_train_data.title[0:5]

In [None]:
fake_news_train_data.text[0] #first text contents

In [None]:
fake_news_train_data.subject.nunique()

In [None]:
fake_news_train_data.subject.unique()
#print(pd.unique(fake_news_train_data['subject']))

#### Explore fake news test data 

In [None]:
print(fake_news_train_data.isnull().sum())
print('************')
print(fake_news_test_data.isnull().sum())

#### There is no any null values in both dataset

In [None]:
test =fake_news_test_data.copy()
train = fake_news_train_data.copy()

test['total']=fake_news_test_data['title']+' '+fake_news_test_data['text']
train['total']=fake_news_train_data['title']+' '+fake_news_train_data['text']

In [None]:
test[['total']]

In [None]:
test_top5=test.head(1000)

In [None]:
test_top5

In [None]:
train_top5= train.head(2000)

In [None]:
type(train)

In [None]:
train_top5[['total']]

## 3.Exploratory Data Analysis

<b>pandas profiling</b> does helps in visualizing and understanding the distribution of each variable. 
It generates a report with all the information easily available


The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data the time to generate the report also increases a lot.

In [None]:
#from pandas_profiling import ProfileReport
#prof = ProfileReport(fake_news_train_data)
#prof.to_file(output_file='fake_news_train_data_eda.html')

In [None]:
#prof ## visualize the dataset

### creating word cloud

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
real_words = ''
fake_words = ''
stopwords = set(STOPWORDS) 
  
# iterate through the csv file 
for val in train[train['class']=='Real'].total: 
  
    # split the value 
    tokens = val.split() 
      
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    real_words += " ".join(tokens)+" "

for val in train[train['class']=='Fake'].total: 
      
    # split the value 
    tokens = val.split() 
      
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
      
    fake_words += " ".join(tokens)+" "
    
 

In [None]:
 #real words
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(real_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 
 

In [None]:
#fake words
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(fake_words) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 
 

## 4.Pre-processing and Training Data Development

- Create dummy or indicator features for categorical variables
- Standardize the magnitude of numeric features
- Split into testing and training datasets
- Apply scaler to the testing set


### Cleaning and preprocessing

####  Using regex,Tokenization,StopWords,Lemmatization

In [None]:
#tokenization
nltk.download('punkt')

In [None]:
#stopword
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
lemmatizer=WordNetLemmatizer()
for index,row in train_top5.iterrows():
    filter_sentence = ''
    
    sentence = row['total']
    sentence = re.sub(r'[^\w\s]','',sentence) #cleaning
    
    words = nltk.word_tokenize(sentence) #tokenization
    
    words = [w for w in words if not w in stop_words]  #stopwords removal
    
    for word in words:
        filter_sentence = filter_sentence + ' ' + str(lemmatizer.lemmatize(word)).lower()
        
    train_top5.loc[index,'total'] = filter_sentence


In [None]:
train_top5 = train_top5[['total','class']]

In [None]:
train_top5.total[1] ##2nd rows values , index starts from zero 

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
X_train = train_top5['total']
Y_train = train_top5['class']

### Bag-of-words / CountVectorizer

In [None]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

In [None]:
print(X.toarray())

### TF-iDF Vectorizer

In [None]:
def vectorize_text(features, max_features):
    vectorizer = TfidfVectorizer( stop_words='english',
                            decode_error='strict',
                            analyzer='word',
                            ngram_range=(1, 2),
                            max_features=max_features                  
                            )
    feature_vec = vectorizer.fit_transform(features)
    return feature_vec.toarray()

In [None]:
tfidf_features = vectorize_text(['hello how are you doing','hi i am doing fine'],30)

In [None]:
tfidf_features

### applying  countVectorizer and tf-idf

In [None]:
#Feature extraction using count vectorization and tfidf.
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)
freq_term_matrix = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)

In [None]:
tf_idf_matrix

In [None]:
test_counts = count_vectorizer.transform(test_top5['total'].values)
test_tfidf = tfidf.transform(test_counts)

#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tf_idf_matrix, Y_train, random_state=0)

## 5.Modeling

- Fit Models with Training Data Set
- Review Model Outcomes — Iterate over additional models as needed.
- Identify the Final Model

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)
print('Accuracy of Logistic classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))
from sklearn.naive_bayes import MultinomialNB
cm = confusion_matrix(y_test, pred)
cm

## MultinomialNB

In [None]:
from sklearn.naive_bayes import MultinomialNB

NB = MultinomialNB()
NB.fit(X_train, y_train)
pred = NB.predict(X_test)
print('Accuracy of NB  classifier on training set: {:.2f}'
     .format(NB.score(X_train, y_train)))
print('Accuracy of NB classifier on test set: {:.2f}'
     .format(NB.score(X_test, y_test)))
cm = confusion_matrix(y_test, pred)
cm

In [None]:
X_train.shape


In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

## Prediction

In [None]:

result1 = logreg.predict(X_test)
result1

In [None]:
result2 = NB.predict(X_test)
result2

In [None]:
 result1==result2


## 6.Documentation 

- Review the Results
- Present and share your findings - storytelling
- Finalize Code
- Finalize Documentation