# FAKE NEWS DETECTION USING DL


## To predict authenticity of a piece of news


* Defining the problem
* Data Preparation
* Exploratory data analysis
* Feature engineering
* Feature Selection
* Modelling
* Training

## 1. Defining the Problem


`..........

## 2. Data Preparation


"Liar, Liar Pants on Fire" dataset is used. Which is a Benchmark dataset used by researchers and also for fake new challenge contests.You can get the dataset [here](https://www.cs.ucsb.edu/william/data/liar_dataset.zip)


### Load train,test,valid dataset using Pandas

In [1]:
import pandas as pd

train = pd.read_csv('train.tsv',delimiter='\t',header=None,)
test = pd.read_csv('test.tsv',delimiter='\t',header=None)
valid = pd.read_csv('valid.tsv',delimiter='\t',header=None)

In [2]:
train = train.drop([0,8,9,10,11,12,],axis=1)
train.columns=["label", "statement", "subject", "speaker", "job", "state", "party", "venue"]
train.head()

Unnamed: 0,label,statement,subject,speaker,job,state,party,venue
0,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,a mailer
1,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,Denver
3,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,a news release
4,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,an interview on CNN


In [3]:
test = test.drop([0,8,9,10,11,12,],axis=1)
test.columns=["label", "statement", "subject", "speaker", "job", "state", "party", "venue"]
test.head()

Unnamed: 0,label,statement,subject,speaker,job,state,party,venue
0,true,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,Radio interview
1,false,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,a news conference
2,false,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,comments on ABC's This Week.
3,half-true,Suzanne Bonamici supports a plan that will cut...,"medicare,message-machine-2012,campaign-adverti...",rob-cornilles,consultant,Oregon,republican,a radio show
4,pants-fire,When asked by a reporter whether hes at the ce...,"campaign-finance,legal-issues,campaign-adverti...",state-democratic-party-wisconsin,,Wisconsin,democrat,a web video


In [4]:
valid = valid.drop([0,8,9,10,11,12,],axis=1)
valid.columns=["label", "statement", "subject", "speaker", "job", "state", "party", "venue"]
valid.head()

Unnamed: 0,label,statement,subject,speaker,job,state,party,venue
0,barely-true,We have less Americans working now than in the...,"economy,jobs",vicky-hartzler,U.S. Representative,Missouri,republican,an interview with ABC17 News
1,pants-fire,"When Obama was sworn into office, he DID NOT u...","obama-birth-certificate,religion",chain-email,,,none,
2,false,Says Having organizations parading as being so...,"campaign-finance,congress,taxes",earl-blumenauer,U.S. representative,Oregon,democrat,a U.S. Ways and Means hearing
3,half-true,Says nearly half of Oregons children are poor.,poverty,jim-francesconi,Member of the State Board of Higher Education,Oregon,none,an opinion article
4,half-true,On attacks by Republicans that various program...,"economy,stimulus",barack-obama,President,Illinois,democrat,interview with CBS News


In [5]:
train.shape

(10240, 8)

In [6]:
test.shape

(1267, 8)

In [7]:
valid.shape

(1284, 8)

In [8]:
train.isnull().sum()

label           0
statement       0
subject         2
speaker         2
job          2897
state        2208
party           2
venue         102
dtype: int64

In [9]:
test.isnull().sum()

label          0
statement      0
subject        0
speaker        0
job          325
state        262
party          0
venue         17
dtype: int64

In [10]:
valid.isnull().sum()

label          0
statement      0
subject        0
speaker        0
job          345
state        279
party          0
venue         12
dtype: int64

In [11]:
train_clean = train.dropna(axis='index')


In [12]:
train_clean.head()


Unnamed: 0,label,statement,subject,speaker,job,state,party,venue
0,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,a mailer
1,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,a floor speech.
2,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,Denver
5,true,The Chicago Bears have had more starting quart...,education,robin-vos,Wisconsin Assembly speaker,Wisconsin,republican,a an online opinion-piece
7,half-true,I'm the only person on this stage who has work...,ethics,barack-obama,President,Illinois,democrat,"a Democratic debate in Philadelphia, Pa."


In [15]:
train_clean.shape

(6724, 8)

In [16]:
test_clean = test.dropna(axis='index')

In [17]:
test_clean.shape

(853, 8)

In [18]:
valid_clean = valid.dropna(axis='index')

In [19]:
valid_clean.shape

(861, 8)

## Text Processing

In [20]:
import csv
import numpy as np
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
import seaborn as sb


### 1. Removing Stop Words

Stop words are the very common words like ‘if’, ‘but’, ‘we’, ‘he’, ‘she’, and ‘they’. We can usually remove these words without changing the semantics of a text and doing so often (but not always) improves the performance of a model. 

In [14]:
#Before removing stopwords

train_clean.iloc[:,1].head()

0    Says the Annies List political group supports ...
1    When did the decline of coal start? It started...
2    Hillary Clinton agrees with John McCain "by vo...
5    The Chicago Bears have had more starting quart...
7    I'm the only person on this stage who has work...
Name: statement, dtype: object

In [24]:
from nltk.corpus import stopwords

english_stop_words = ['at','of','is','in','on','a','the','are']
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

train_clean.loc[:,'statement'] = remove_stop_words(train_clean.loc[:,'statement'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [27]:
#After removing stopwords

train_clean.loc[:,'statement'].head()

0    Says Annies List political group supports thir...
1    When did decline coal start? It started when n...
2    Hillary Clinton agrees with John McCain "by vo...
5    The Chicago Bears have had more starting quart...
7    I'm only person this stage who has worked acti...
Name: statement, dtype: object

### 2. Normalization

A common next step in text preprocessing is to normalize the words in your corpus by trying to convert all of the different forms of a given word into one. Two methods that exist for this are Stemming and Lemmatization.

#### Stemming 

It is considered to be the more crude/brute-force approach to normalization (although this doesn’t necessarily mean that it will perform worse). There’s several algorithms, but in general they all use basic rules to chop off the ends of words.

In [28]:
def get_stemmed_text(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

stemmed_news = get_stemmed_text(train_clean.iloc[:,1])

In [29]:
stemmed_news[:5]

['say anni list polit group support third-trimest abort demand.',
 'when did declin coal start? It start when natur ga took off that start to begin (presid georg w.) bush administration.',
 'hillari clinton agre with john mccain "bi vote to give georg bush benefit doubt iran."',
 'the chicago bear have had more start quarterback last 10 year than total number tenur (uw) faculti fire dure last two decades.',
 "i'm onli person thi stage who ha work activ just last year passing, along with russ feingold, some toughest ethic reform sinc watergate."]

#### Lemmatization

Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.

In [34]:
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized = get_lemmatized_text(train_clean.loc[:,'statement'])

In [39]:
lemmatized[:5]

['Says Annies List political group support third-trimester abortion demand.',
 'When did decline coal start? It started when natural gas took off that started to begin (President George W.) Bushs administration.',
 'Hillary Clinton agrees with John McCain "by voting to give George Bush benefit doubt Iran."',
 'The Chicago Bears have had more starting quarterback last 10 year than total number tenured (UW) faculty fired during last two decades.',
 "I'm only person this stage who ha worked actively just last year passing, along with Russ Feingold, some toughest ethic reform since Watergate."]

In [None]:
# i dont its not working now
# fix it sachin :P

In [40]:
# So i am using stemmed news instead of lemmatised

train_clean.iloc[:,1] = stemmed_news

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [41]:
# news after Normalization
train_clean.iloc[:,1].head()

0    say anni list polit group support third-trimes...
1    when did declin coal start? It start when natu...
2    hillari clinton agre with john mccain "bi vote...
5    the chicago bear have had more start quarterba...
7    i'm onli person thi stage who ha work activ ju...
Name: statement, dtype: object

### N grams 

In [None]:
# good luck ma boyy!!!!!!!!