**Background**

Most currently available fake news datasets revolve around US politics, entrainment news or satire. They are typically scraped from fact-checking websites, where the articles are labeled by human experts.
This dataset around the Syrian war. Given the specific nature of news reporting on incidents of wars and the lack of available sources from which manually-labeled news articles can be scraped.

**About the dataset**

The dataset consists of news articles from several media outlets representing mobilisation press, loyalist press, and diverse print media.Also,consists of a set of articles/news labeled by 0 (fake) or 1 (credible).
The dataset consists of 804 articles labeled as true or fake and that is ideal for training machine learning models to predict the credibility of news articles.

Credibility of articles are computed with respect to a ground truth information obtained from the Syrian Violations Documentation Center (VDC). This dataset is collected by researchers at American University of Beirut(AUB).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
import itertools
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**Loading the data**

In [None]:
df=pd.read_csv('/kaggle/input/a-fake-news-dataset-around-the-syrian-war/FA-KES-Dataset.csv',encoding='latin1')
df.head()

As we can see, the dataset contains the articl title, article content, media source, date of incident,where and the incident happend and the labels(real or fake).

**Preliminary text exploration**


Before we proceed with any text pre-processing, it is advisable to quickly explore the dataset in terms of word counts, most common and most uncommon words.

Count NaN or missing values in DataFrame

In [None]:
df.isnull().sum().sum()

In [None]:
print('There are {} rows and {} columns in train'.format(df.shape[0],df.shape[1]))

In [None]:
print(df.article_content.describe())

We have duplicated rows in our dataset

Find Duplicate Rows based on all columns

In [None]:
ddf = df[df.duplicated()]
print(ddf)

Duplicated rows might affect on our results, So, we should remove them.

In [None]:
df.drop_duplicates(keep=False, inplace=True)

In [None]:
ddf = df[df.duplicated()]
print(ddf)

Now we can move forward in our task!

It's better to strat with understaning how our dataset distributed according to the label(labels 0/1)

In [None]:
#Show Labels distribution

df['labels'].value_counts(normalize=True)


Our dataset is a bit unbalanced towords real news(1)

In [None]:
sns.countplot(x='labels', data=df)

**Exploratory Data Analysis of News**

In [None]:
df['source'].value_counts().plot(kind='barh')

We can see here sources of news in an ascending order

In [None]:

df.groupby(['source','labels']).size().unstack().plot(kind='bar',stacked=False)
plt.figure(figsize=(20,10))
plt.show()

It showes here how each source is contributing in real or fake news

We will do very basic analysis,that is character level,word level and sentence level analysis.

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
true_len=df[df['labels']==1]['article_content'].str.len()
ax1.hist(true_len,color='green')
ax1.set_title('Real News')
fake_len=df[df['labels']==0]['article_content'].str.len()
ax2.hist(fake_len,color='red')
ax2.set_title('Fake News')
fig.suptitle('Characters in an article')
plt.show()

Number of words in a article

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
true_len=df[df['labels']==1]['article_content'].str.split().map(lambda x: len(x))
ax1.hist(true_len,color='green')
ax1.set_title('Real News')
fake_len=df[df['labels']==0]['article_content'].str.split().map(lambda x: len(x))
ax2.hist(fake_len,color='red')
ax2.set_title('Fake News')
fig.suptitle('Words in an article')
plt.show()

Average word length in a article

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
word=df[df['labels']==1]['article_content'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1,color='green')
ax1.set_title('Real')
word=df[df['labels']==0]['article_content'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='red')
ax2.set_title('Fake')
fig.suptitle('Average word length in each article')

**The Most common words in Real news**

In [None]:
mfreq = pd.Series(' '.join(df[df['labels']==1]['article_content']).split()).value_counts()[:25]
mfreq

**Data Exploration****

We will now visualize the text  to get insights on the most frequently used words.

We will use TfidfVectorizer for some text pre-processing like removing stop words and to get the vocabularies in our articles

In [None]:
vect = TfidfVectorizer(use_idf=True,max_df=0.40,min_df=0.1,stop_words='english').fit(df[df['labels']==1]['article_content'])
len(vect.get_feature_names())

In [None]:
list(vect.vocabulary_.keys())[:10]

Wordcloud for words in real news after some cleaning and deleting stop words using TfidfVectorizer 

In [None]:
true_tfidf=list(vect.vocabulary_.keys())
wordcloud = WordCloud(width=1600, height=800).generate(str(true_tfidf))
#  plot word cloud image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

And let's see the most common words in fake news

In [None]:
mfreq = pd.Series(' '.join(df[df['labels']==0]['article_content']).split()).value_counts()[:25]
mfreq

In [None]:
vect = TfidfVectorizer(use_idf=True,max_df=0.40,min_df=0.1,stop_words='english').fit(df[df['labels']==0]['article_content'])
len(vect.get_feature_names())

Wordcloud for words in fake news after some cleaning and deleting stop words using TfidfVectorizer 

In [None]:
fake_tfidf=list(vect.vocabulary_.keys())
wordcloud = WordCloud(width=1600, height=800).generate(str(fake_tfidf))
#  plot word cloud image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Let's see down wordclod for the whole articles(real and fake) from article_content

In [None]:
#Intialize TfidfVectorizer
tfidf_vect=TfidfVectorizer(stop_words='english',max_df=0.4,min_df=0.1).fit(df['article_content'])
len(tfidf_vect.get_feature_names())

In [None]:
txt_tfidf=list(tfidf_vect.vocabulary_.keys())
wordcloud = WordCloud(width=1600, height=800).generate(str(txt_tfidf))
#  plot word cloud image.

plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

**Classifier: Features and Design**

* To train supervised classifiers, we first transformed the “article_content” into a vector of numbers. We explored vector representations such as TF-IDF weighted vectors.

* After having this vector representations of the text we can train supervised classifiers to train unseen “article_content” and predict the “labels”(0/1) on which they fall.

After all the above data transformation, now that we have all the features and labels, it is time to train the classifiers. There are a number of algorithms we can use for this type of problem.

Naive Bayes Classifier: the one most suitable for word counts is the multinomial variant:

In [None]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, encoding='latin-1', ngram_range=(1, 2), stop_words='english')

features = tfidf.fit_transform(df.article_content).toarray()
labels = df.labels
features.shape

Naive Bayes Classifier: the one most suitable for word counts is the multinomial variant:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['article_content'], df['labels'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

Let's try predicting on recent news ???

In [None]:
print(clf.predict(count_vect.transform(["The Syrian army has taken control of a strategic northwestern crossroads town, its latest gain in a weeks-long offensive against the country's last major rebel bastion."])))

Awesome!!!!!!!!!!!!!!!!!!!! That's correct

**Model Selection**


We are now ready to experiment with different machine learning models, evaluate their accuracy and find the source of any potential issues.

We will benchmark the following four models:

* Logistic Regression 
* (Multinomial) Naive Bayes 
* Linear Support Vector Machine 
* Random Forest

In [None]:
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0)]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])


sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

In [None]:
cv_df.groupby('model_name').accuracy.mean()

**Conclusion**


The accuracy of these models on predicting is low.in this case I think it's better to go and collect more data rathar than trying another model to get better accuracy