# Fake News Classification

![](https://world.edu/wp-content/uploads/2019/03/fake-fact.jpg)

<h2>Fake news is untrue information presented as news. They don't have have no basis in fact, but are presented as being factually accurate.</h2>

<h2>Misinformation is false or inaccurate information. Examples of misinformation include false rumors, or insults and pranks, while examples of more deliberate disinformation include malicious content such as hoaxes, spearphishing and computational propaganda.</h2>

# There are four broad categories of fake news, according to media professor `Melissa Zimdars` of `Merrimack College`

<h3>CATEGORY 1: Fake, false, or regularly misleading websites that are shared on Facebook and social media. Some of these websites may rely on “outrage” by using distorted headlines and decontextualized or dubious information in order to generate likes, shares, and profits.</h3>

<h3>CATEGORY 2: Websites that may circulate misleading and/or potentially unreliable information</h3>

<h3>CATEGORY 3: Websites which sometimes use clickbait-y headlines and social media </h3>
    
<h3>CATEGORY 4: Satire/comedy sites, which can offer important critical commentary on politics and society, but have the potential to be shared as actual/literal news</h3>

**Our main focus will be on those websites who intentionally write fake news articles means 2,3 & 4th category.**

# Characterization and Detection

<h2>Social media for news consumption is a double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of information allow users to consume and share the news. On the other hand, it can make viral “fake news”, i.e., low-quality news with intentionally false information. The quick spread of fake news has the potential for calamitous impacts on individuals and society. For example, the most popular fake news was more widely spread on Facebook than the most popular authentic mainstream news during the U.S. 2016 president election. Therefore, fake news detection on social media has attracted increasing attention from researchers to politicians.</h2>

![](https://www.kdnuggets.com/images/fake-news-detection-611.jpg)

<h2>Fake news through news content approach</h2>

<h3>News content based approaches focus on extracting various features in fake news content, including knowledge-based and style-based. Since fake news attempts to spread false claims, knowledge-based approaches aim to using external sources to fact-check the truthfulness of the claims in news content. In addition, fake news publishers often have malicious intents to spread distorted and misleading, requiring particular writing styles to appeal to and persuade a wide scope of consumers that are not seen in true news articles. Style-based approaches try to detect fake news by capturing the manipulators in the writing style.</h3>

# Main components of any news article

> Title 

> Body/Text part

> Source

> Author

> Type of news/Tags(ex. politics,religious, etc.)

> Date

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
true = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/True.csv")
fake = pd.read_csv("/kaggle/input/fake-and-real-news-dataset/Fake.csv")

In [None]:
print(true.shape,fake.shape)

In [None]:
true.head(20)

In [None]:
true.info()

In [None]:
true.subject.value_counts()

In [None]:
true.nunique()/true.shape[0]

In [None]:
true.isna().sum()/true.shape[0]

In [None]:
true.describe(include=['object'])

In [None]:
true.subject.value_counts()

In [None]:
fake.head()

In [None]:
fake.describe(include=['object'])

In [None]:
fake.nunique()/fake.shape[0]*100

In [None]:
fake['title_len'] = fake.title.apply(len)
true['title_len'] = true.title.apply(len)

In [None]:
fake['text_len'] = fake.text.apply(len)
true['text_len'] = true.text.apply(len)

In [None]:
plt.figure(figsize=(12,8))
sns.kdeplot(true.text_len)
sns.kdeplot(fake.text_len)
plt.legend(['True','Fake'])
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.kdeplot(true.title_len)
sns.kdeplot(fake.title_len)
plt.legend(['True','Fake'])
plt.show()

In [None]:
import string
from nltk.corpus import stopwords
from collections import Counter

In [None]:
stopword = stopwords.words('english')

In [None]:
string.punctuation

In [None]:
def wc_proccessed(x):
    text = x.strip().lower()
    text_str = "".join([i if i not in string.punctuation else " " for i in text]).strip()
    text = [i for i in text_str.split() if i.isalpha()]
    text_str = [i for i in text if i not in stopword]
    text = " ".join([i for i in text_str if len(i) > 1 and len(i) <=45]) #Longest word in english dictionary have 45 characters
    return text

In [None]:
fake_text_process = " ".join([i for i in fake.text.apply(wc_proccessed)])
true_text_process = " ".join([i for i in true.text.apply(wc_proccessed)])

In [None]:
word_count_fake_text = Counter(fake_text_process.split())
word_count_true_text = Counter(true_text_process.split())

In [None]:
plt.figure(figsize=(26,10))
sns.barplot(x=[i[0] for i in word_count_fake_text.most_common(50)],y=[i[1] for i in word_count_fake_text.most_common(50)])
plt.xticks(rotation=60)
for i,j in enumerate([i[1] for i in word_count_fake_text.most_common(50)]):
    plt.text(i-0.3,j+300,str(j))
plt.show()

In [None]:
plt.figure(figsize=(26,10))
sns.barplot(x=[i[0] for i in word_count_true_text.most_common(50)],y=[i[1] for i in word_count_true_text.most_common(50)])
plt.xticks(rotation=60)
for i,j in enumerate([i[1] for i in word_count_true_text.most_common(50)]):
    plt.text(i-0.3,j+300,str(j))
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x=['Fake',"Real","Fake - Real"],y=[len(set(word_count_fake_text.keys())),len(set(word_count_true_text.keys())),len(set(word_count_fake_text.keys()) - set(word_count_true_text.keys()))])
for i,j in enumerate([len(set(word_count_fake_text.keys())),len(set(word_count_true_text.keys())),len(set(word_count_fake_text.keys()) - set(word_count_true_text.keys()))]):
    plt.text(i-0.2,j+40,str(j))
plt.show()

In [None]:
true['Type'] = "True"
fake['Type'] = "Fake"

In [None]:
true_fake = pd.concat([true,fake])

In [None]:
true_fake.head()

In [None]:
true_fake.shape

In [None]:
true_fake.index = range(0,true_fake.shape[0])

In [None]:
true_fake.drop_duplicates(subset=['title','text']).shape

In [None]:
true_fake.drop_duplicates(subset=['title','text'],inplace=True)

In [None]:
true_fake.groupby(['subject']).Type.value_counts().plot.bar()
plt.show()

In [None]:
true_fake.subject.value_counts()

In [None]:
def final_proccessed(x):
    text = x.strip().lower()
    text_str = "".join([i if i not in string.punctuation else " " for i in text]).strip()
    text = [i for i in text_str.split() if i.isalpha()]
    text_str = [i for i in text if i not in stopword]
    text = [i for i in text_str if len(i) > 1 and len(i) <=45]
    return text

In [None]:
from wordcloud import wordcloud

In [None]:
wc = wordcloud.WordCloud(stopwords=stopword,width=1600,height=900,margin=1,max_words=200,prefer_horizontal=1,mode='RGB',relative_scaling=0.5,min_font_size=3).generate(" ".join([i for i in fake.Proccessed_text]).lower())

In [None]:
plt.figure(figsize=(14,8))
plt.imshow(wc)
plt.axis(False)
plt.show()

In [None]:
true_fake['title_text'] = true_fake.title + " " + true_fake.text

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(true_fake.title_text,true_fake.Type, test_size=0.1, random_state=42)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [None]:
cv = CountVectorizer(analyzer=final_proccessed).fit(X_train)
cv_trans = cv.transform(X_train)

In [None]:
cv_trans.shape

In [None]:
tfidf = TfidfTransformer().fit(cv_trans)
tfidf_trans = tfidf.transform(cv_trans)

In [None]:
tfidf_trans.shape

In [None]:
cv_trans_test = cv.transform(X_test)
tfidf_trans_test = tfidf.transform(cv_trans_test)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
models = [LogisticRegression(random_state=42),LinearSVC(random_state=42),KNeighborsClassifier(),RandomForestClassifier(random_state=42),DecisionTreeClassifier(random_state=42)]

In [None]:
for i in range(100,1001,100):
    print("*************{}***************".format(i))
    lr= LogisticRegression(max_iter=100,C=3.5)
    lr.fit(tfidf_trans,y_train)
    print(accuracy_score(y_test,lr.predict(tfidf_trans_test)))

In [None]:
for i in range(2,101,1):
    print("*************{}***************".format(i))
    lr= DecisionTreeClassifier(random_state=42,max_depth=22,min_samples_split=i)
    lr.fit(tfidf_trans,y_train)
    print(accuracy_score(y_test,lr.predict(tfidf_trans_test)))

In [None]:
lr= RandomForestClassifier(random_state=42)
lr.fit(tfidf_trans,y_train)
print(accuracy_score(y_test,lr.predict(tfidf_trans_test)))

In [None]:
for i in models:
    print(i)

In [None]:
models_score = []
for i in models:
    i.fit(tfidf_trans,y_train)
    models_score.append(accuracy_score(y_test,i.predict(tfidf_trans_test)))

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(y=models_score,x=models)
for i,j in enumerate(models_score):
    plt.text(i-0.3,j+0.02,str(j))
plt.xticks(rotation=90)