# Fake and Real News

## Overview

Can you use this data set to make an algorithm able to determine if an article is fake news or not ?

## Data Description
Fake.csv file contains a list of articles considered as "fake" news. True.csv contains a list of articles considered as "real" news. Both the files contain

* The title of the article
* The text of the article
* The subject of the article
* The date that this article was posted at

## Files

* Fake.csv
* True.csv

## So let’s begin here…

In [None]:
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from string import punctuation

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Load Data

In [None]:
real = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")
fake = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")

In [None]:
real.head()

In [None]:
fake.head()

We will add a new column for both real and fake dataframe. This column will have 0 and 1. 1 for real news and 0 for fake news.

In [None]:
real['category']=1
fake['category']=0

We will concatenate both the dataframe in a single dataframe and we will use this for training.

In [None]:
df = pd.concat([real,fake])

In [None]:
df.isna().sum()

In [None]:
df['title'].count()

In [None]:
df.subject.value_counts()

We now concatenate Text, Title and Subject in Text.

In [None]:
df['text'] = df['text'] + " " + df['title'] + " " + df['subject']
del df['title']
del df['subject']
del df['date']

In [None]:
stop = set(stopwords.words('english'))
pnc = list(punctuation)
stop.update(pnc)

In [None]:
stemmer = PorterStemmer()
def stem_text(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            word = stemmer.stem(i.strip())
            final_text.append(word)
    return " ".join(final_text)

In [None]:
df['text'] = df['text'].apply(stem_text)

Splitting dataset in train set and test set

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df['text'],df['category'])

In [None]:
cv = CountVectorizer(min_df=0,max_df=1,ngram_range=(1,2))

cv_train = cv.fit_transform(X_train)
cv_test = cv.transform(X_test)

print('Train shape: ',cv_train.shape)
print('Test shape: ',cv_test.shape)

## Define Model

In [None]:
nb = MultinomialNB()

## Fit Model

In [None]:
nb.fit(cv_train, y_train)

## Predict

In [None]:
pred_nb = nb.predict(cv_test)

#### Accuracy

In [None]:
score = metrics.accuracy_score(y_test, pred_nb)
print("Accuracy Score: ",score)