# Detecting Fake News with logistic regression

## 📰 Introduction
In this assignment, you will analyze a dataset of news articles labeled as either 'real' or 'fake'.
You will explore the data, engineer features, build a logistic regression model package, and evaluate the model's performance.

Note: This assignment is based on Chapter 21 of the "Learning Data Science" textbook.
Link: https://learningds.org/ch/21/fake_news_intro.html

## 📦 Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer

## 📁 Step 2: Load and Understand the Data

Run the cell below to:
- Load the dataset from `fake_news.csv`.
- Display the first five rows to get an overview.

In [None]:
df = pd.read_csv('fake_news.csv')
df.head()

The dataset comprises news articles labeled as either "REAL" or "FAKE". Each entry includes metadata and content of the article. More specifically, the following variables are included:
- `timestamp`: The date and time the article was published or collected.
- `baseurl`: The domain or website where the article was published.
- `content`: The full text of the news article.
- `label`: Indicates whether the article is real or fake. Values are "REAL" and "FAKE".

### Task 1: missing values
Check for data types and missing values.

In [None]:
# your code here
...

*Interpretation*: Your interpretation here

### Task 2: bias
(see Section 21.1 of "Learning Data Science")

This dataset is a simplified version of the FakeNewsNet data repository described in [Shu et al](https://arxiv.org/abs/1809.01286). This repository contains content from news and social media websites, as well as metadata like user engagement metrics. For simplicity, we only look at the dataset’s political news articles. This subset of the data includes only articles that were fact-checked by Politifact, a nonpartisan organization with a good reputation. Each article in the dataset has a “real” or “fake” label based on Politifact’s evaluation, which we use as the ground truth.

Politifact uses a nonrandom sampling method to select articles to fact-check. According to its website, Politifact’s journalists select the “most newsworthy and significant” claims each day. Politifact started in 2007 and the repository was published in 2020, so most of the articles were published between 2007 and 2020.

Summarizing this information, we determine that the target population consists of all political news stories published online in the time period from 2007 to 2020 (we would also want to list the sources of the stories). The access frame is determined by Politifact’s identification of the most newsworthy claims of the day.

Based on the dataset's structure and content, discuss potential sources of bias that could affect model performance. Consider aspects such as:
- The origin of the articles (e.g., specific publishers or websites).
- The time frame during which the articles were published.
- The topics covered and their distribution across real and fake news.
- Any preprocessing steps already applied to the dataset.

*Your answer here*

## 📊 Step 3: Exploratory Data Analysis (EDA)

### Splitting the data in Training and Testing Sets

Before diving into EDA, we split our dataset into training (75%) and testing (25%) sets. For this sake, we use the method [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from the Scikit-learn package (click on the function name to access the documentation). Additionally, we ecode our `label` column by using `1` for the `fake` label and `0` for the `real` label.

In [None]:
from sklearn.model_selection import train_test_split

df['label'] = (df['label'] == 'fake').astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    df[['timestamp', 'baseurl', 'content']], df['label'],
    test_size=0.25, random_state=42,
)

### Task 3: Exploring the distribution of real vs fake
- Count the number of real vs. fake articles in the training data. (Hint: Use [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)))
- Compute the proportion of fake articles.
- Plot the class distribution. (Hint: Use [`sns.histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) or [`plt.hist()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html))

In [None]:
# your code here

### Task 4: Exploring the publishers
Analyze the `baseurl` column to inspect article sources:

a) Count the number of articles per source. *(Hint: use `value_counts()`; [pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html))*

b) Visualize the top 10 sources using a horizontal bar chart with the number of articles as value on the x-axis. (Hint: [`pd.Series.plot(kind='barh')`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html))

c) List the top 10 sources of fake news (`label == 1`).

d) List the top 10 sources of real news (`label == 0`).

In [None]:
# your code for a) here

In [None]:
# your code for b) here

In [None]:
# your code for c) here

In [None]:
# your code for d) here

*Your observations here*

### Task 5: Exploring words
In this task, we want to explore whether there’s a connection between the language used in the articles and whether they were identified as fake. A straightforward approach is to focus on specific words—like military—and count how often articles containing that word were labeled as fake. For a word like military to be considered informative, the percentage of fake articles that mention it should be significantly higher or lower than 45%, which is the overall proportion of fake articles in the dataset (264 out of 584).

Define a function `make_word_features()` with the following signature:
- arguments:
    - DataFrame `df` which needs to have a column called `content`
    - list `words`
- output: DataFrame with the same number of observations as `df` and with one column per word in the input list `word`. For each sample word the new feature contains either `True` or `False` depending on whether the word is contained in `content` or not.


*Hints:* 
- You can create a DataFrame from a dictionary (see [here](https://www.geeksforgeeks.org/how-to-create-dataframe-from-dictionary-in-python-pandas/))
- To check whether a certain word in included in a string, use the function [`str.contains()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html) (click to open the documentation).

In [None]:
# your code here

Next, we test your new function with some words.

### Task 6: Interpreting word predictors
In the cell below, we test your new function with some test words. The subsequent cell creates a graph to visualize the results. Carefully look at the test words and at the result of this analysis. What conclusions can we draw from this regarding our modeling task?

In [None]:
# run this cell
word_features = [
    'trump', 'clinton', # names of presidents
    'state', 'vote', 'congress', 'shutdown', # congress words
    'military', 'princ', 'investig', 'antifa', 'joke', 'homeless',
    'swamp', 'cnn', 'the' #other possibly useful words
]
df_words = make_word_features(df, word_features)
df_words

In [None]:
# run this cell
fake_props = (make_word_features(X_train, word_features)
 .assign(label=(y_train == 1))
 .melt(id_vars=['label'], var_name='word', value_name='appeared')
 .query('appeared == True')
 .groupby('word')
 ['label']
 .agg(['mean', 'count'])
 .rename(columns={'mean': 'prop_fake'})
 .sort_values('prop_fake', ascending=False)
 .reset_index()
 .melt(id_vars='word')
)

g = sns.catplot(
    data=fake_props,
    x='value',
    y='word',
    col='variable',
    hue='variable',        # Color-code by the type of metric (prop_fake or count)
    s=15,                  # Increase dot size
    jitter=False,
    sharex=False,
    height=3,
    aspect=1.3
)


[[prop_ax, _]] = g.axes
prop_ax.axvline(fake_proportion, linestyle='--')
prop_ax.set(xlim=(-0.05, 1.05))

titles = ['Proportion of articles marked fake', 'Number of articles with word']

for ax, title in zip(g.axes.flat, titles):
    # Set a different title for each axes
    ax.set(title=title)
    ax.set(xlabel=None)
    ax.set(ylabel=None)
    ax.yaxis.grid(True);

*Your observations here*

## 🧠 Step 4: Modeling

### Task 7: Building our first logistic regression model
Our EDA showed that the word *vote* is related to whether an article is labeled real or fake. To test this, we fit a logistic regression model using a single binary feature: `1` if the word vote appears in the article and `0` if not.

To do so, follow the steps below (see slide 7 of the small slide deck for today's lecture):

a) Write a function `lowercase` that takes a dataframe `df` as argument and returns a copy of this dataframe with the `content` column of the `df` dataframe being transformed to lowercase strings. (Hint: apply the [.str.lower()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html) method)

b) Create a pipeline using [make_pipeline()](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) which includes the following elements:
- Preprocessing step 1: your `lowercase` function from step 1 (wrapped into the pipeline using [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html))
- Preprocessing step 2: your `make_word_features` function from Taks 5, using *vote* as the only word to check for.
- Modeling step: `LogisticRegression(penalty='none')

c) Train the model on the training data (`fit()`)

d) Evaluate the accuracy score

In [None]:
# Your code for a) here
model1 = ...

In [None]:
# Your code for b) here

In [None]:
# Your code for c) here

In [None]:
# Your code for d) here

Run the cell below to see the accuary matrix:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    model1, X_test, y_test, cmap=plt.cm.Blues, colorbar=False
)
plt.grid(False);

### Task 8: Interpreting our first logistic regression model
Extract the coefficients of your logistic regression model from Task 7. With these, compute the probability for your model gives to an article that contains the word *vote* to be fake. Do the same for an article that does not contain the word *vote*.

(Hint: You can access the logistic regression model from the pipeline by accessing `model1.steps[2][1]` (because the logistic regression is the third step in the pipeline which has the Python index 2, the model is then the second entry of a tuple which has index 1). From this you get the coefficients as `coef_` respectively as `incercept_`. E.g., the *vote* coefficient is computed as `model1.steps[2][1].coef_`)

In [None]:
# your code here

### Task 9: Building a more complex model
Now create a model that uses all of the words we examined in our EDA of the train set, except for *the*. Again compute the accuracy and display the confusion matrix.

In [None]:
#your code here
model2 = ...

In [None]:
#accuracy: your code here

In [None]:
# run this cell for the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    model2, X_test, y_test, cmap=plt.cm.Blues, colorbar=False
)
plt.grid(False);