<a href="https://colab.research.google.com/github/solvedbrunus/Project-2-NLP/blob/main/nlp_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0- Import Libraries

Data Manipulation:

pandas: Provides data structures like DataFrames, which are useful for handling and processing structured data.

In [1]:
import pandas as pd

Feature Extraction:

CountVectorizer: Converts a collection of text documents to a matrix of token counts.

TfidfVectorizer: Converts a collection of raw documents to a matrix of TF-IDF features, which reflect the importance of words in the documents.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

Model Training:

MultinomialNB: Implements the Multinomial Naive Bayes algorithm, which is suitable for classification with discrete features (like word counts for text classification).

In [3]:
from sklearn.naive_bayes import MultinomialNB

Model Evaluation:

accuracy_score: Calculates the accuracy of the model by comparing the predicted labels with the true labels.

In [4]:
from sklearn.metrics import accuracy_score

Install the chardet library detect the encoding programmatically

In [5]:
%pip install chardet




# 1- Load Dataset

1a - Define the file path once

In [None]:
file_path = ""

1b- Load the dataset

In [7]:
data = pd.read_csv(file_path)

1c- Detect the encoding

In [8]:
import chardet

# Read the first few bytes of the file to detect the encoding
with open(file_path, 'rb') as file:
    raw_data = file.read(10000)
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    print(f"The detected encoding is: {encoding}")

The detected encoding is: UTF-8-SIG


1d- Load the dataset with the correct encoding to handle BOM

In [9]:
data = pd.read_csv(file_path, encoding='utf-8-sig', header=None)

1e- Display the column names to verify they are correctly parsed

In [10]:
print(data.columns)

print("\n")

if list(data.columns) == [0]:
    print("This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.")
else:
    print(f"The dataset has the following columns: {list(data.columns)}")

Index([0], dtype='int64')


This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.


1f- Display the first few rows of the dataset

In [11]:
print(data.head())

                                                   0
0  0\tdonald trump sends out embarrassing new yea...
1  0\tdrunk bragging trump staffer started russia...
2  0\tsheriff david clarke becomes an internet jo...
3  0\ttrump is so obsessed he even has obama‚s na...
4  0\tpope francis just called out donald trump d...


**Conclusion:**

Between the results from step 2f and 2g we have identified that the dataset is not correctly divided between label and text, and that the label is incorporated on each sentence at the begining, therefore we have to separate the characters '0\t' from the main sentences. For this reason re restart the process of loading the dataset and later splitting the sentences after the '0\t' part, which will then become our new label column.

1g- Load the dataset with the correct encoding to handle BOM

In [12]:
data = pd.read_csv(file_path, encoding='utf-8-sig', header=None)

1h- Display the first few rows of the dataset

In [13]:
print(data.head())

                                                   0
0  0\tdonald trump sends out embarrassing new yea...
1  0\tdrunk bragging trump staffer started russia...
2  0\tsheriff david clarke becomes an internet jo...
3  0\ttrump is so obsessed he even has obama‚s na...
4  0\tpope francis just called out donald trump d...


1i- Separate the first part of the sentences with the separator '0\t' and assign it as the label column

In [14]:
data[['label', 'text']] = data[0].str.split('\t', n=1, expand=True)

1j- Drop the original column

In [15]:
data = data.drop(columns=[0])

1k- Display the first few rows of the dataset after removing the first part

In [16]:
print(data.head())

print("\n")

if list(data.columns) == [0]:
    print("This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.")
else:
    print(f"The dataset has the following columns: {list(data.columns)}")

  label                                               text
0     0  donald trump sends out embarrassing new year‚s...
1     0  drunk bragging trump staffer started russian c...
2     0  sheriff david clarke becomes an internet joke ...
3     0  trump is so obsessed he even has obama‚s name ...
4     0  pope francis just called out donald trump duri...


The dataset has the following columns: ['label', 'text']


# 2- Preprocess Data

2a- Check for missing values in the dataset

In [17]:
missing_values = data.isnull().sum()
print(missing_values)

print("\n")

print(f"This result means that in the column named 'label', the amount of missing values is {missing_values['label']}, and in the column named 'text', the amount of missing values is {missing_values['text']}.")

label    0
text     0
dtype: int64


This result means that in the column named 'label', the amount of missing values is 0, and in the column named 'text', the amount of missing values is 0.


2b- Conditionally drop rows with missing values in the 'label' column

In [18]:
if missing_values['label'] > 0:
    data = data.dropna(subset=['label'])

    
    print("Rows with missing values in the 'label' column have been dropped.")
else:
    print("No missing values found in the 'label' column. No rows were dropped.")

No missing values found in the 'label' column. No rows were dropped.


2c- Check Data After Preprocessing

In [19]:
print(data.isnull().sum())
print(data.head())
print(data.shape)

print("\n")

print(f"This result means that the dataset has {data.shape[0]} rows and {data.shape[1]} columns. The two columns are 'label' and 'text'.")

label    0
text     0
dtype: int64
  label                                               text
0     0  donald trump sends out embarrassing new year‚s...
1     0  drunk bragging trump staffer started russian c...
2     0  sheriff david clarke becomes an internet joke ...
3     0  trump is so obsessed he even has obama‚s name ...
4     0  pope francis just called out donald trump duri...
(34152, 2)


This result means that the dataset has 34152 rows and 2 columns. The two columns are 'label' and 'text'.


# 3- Split Data

We split the dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate the model’s performance.

3a- Define the test size and random state as variables

In [20]:
test_size = 0.2
random_state = 42

3b- Split the data into training and testing sets

In [21]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=test_size, random_state=random_state)

print("The dataset has been split into training and testing sets.")
print(f"X_train and y_train are the training data and labels, respectively.")
print(f"X_test and y_test are the testing data and labels, respectively.")
print(f"{test_size * 100}% of the data is used for testing, and {(1 - test_size) * 100}% is used for training.")
print(f"The random_state={random_state} ensures that the split is reproducible, meaning that every time you run the code with the same random_state value, you will get the same split of training and testing data. This is important for consistency and reliability in your results.")

The dataset has been split into training and testing sets.
X_train and y_train are the training data and labels, respectively.
X_test and y_test are the testing data and labels, respectively.
20.0% of the data is used for testing, and 80.0% is used for training.
The random_state=42 ensures that the split is reproducible, meaning that every time you run the code with the same random_state value, you will get the same split of training and testing data. This is important for consistency and reliability in your results.


# 4- Feature Extraction

We use two methods to convert the text data into numerical features: Count Vectorizer and TF-IDF Vectorizer.

4a- Method 1: Count Vectorizer

Converts text into a matrix of token counts.

In [22]:
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

4b- Method 2: TF-IDF Vectorizer

Converts text into a matrix of TF-IDF features, which reflect the importance of words in the documents.

In [23]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# 5- Train and Evaluate Model

5a- Using Count Vectorizer features

In [24]:
model_count = MultinomialNB()
model_count.fit(X_train_count, y_train)
y_pred_count = model_count.predict(X_test_count)
accuracy_count = accuracy_score(y_test, y_pred_count)


5b- Using TF-IDF Vectorizer features

In [25]:
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)

5c- Compare the accuracies

In [26]:
if accuracy_tfidf > accuracy_count:
    best_representation = "TF-IDF Vectorizer"
    best_accuracy = accuracy_tfidf
else:
    best_representation = "Count Vectorizer"
    best_accuracy = accuracy_count

print(f"The best feature representation is {best_representation} with an accuracy of {best_accuracy:.4f}.")

print("\n")

if best_representation == "Count Vectorizer":
    print("This means:\n\nCount Vectorizer: Using the word counts as features worked better than using TF-IDF scores.\nAccuracy: The model correctly identified whether the news is fake or real {:.2f}% of the time.".format(best_accuracy * 100))
else:
    print("This means:\n\nTF-IDF Vectorizer: Using the TF-IDF scores as features worked better than using word counts.\nAccuracy: The model correctly identified whether the news is fake or real {:.2f}% of the time.".format(best_accuracy * 100))

The best feature representation is Count Vectorizer with an accuracy of 0.9448.


This means:

Count Vectorizer: Using the word counts as features worked better than using TF-IDF scores.
Accuracy: The model correctly identified whether the news is fake or real 94.48% of the time.


# Integrating New Test Dataset Analysis

# 1- Load the New Test Dataset

1a - Define the file path once

In [117]:
new_file_path = "D:/OneDrive - Royal HaskoningDHV/920791/Pri 3/ironhack/Project-2-NLP/Project-2-NLP/dataset/test_data_news_only.csv"

1b- Load the dataset

In [118]:
new_data = pd.read_csv(new_file_path)

1c- Detect the encoding

In [119]:
import chardet

# Read the first few bytes of the file to detect the encoding
with open(new_file_path, 'rb') as file:
    raw_data = file.read(10000)
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    print(f"The detected encoding is: {encoding}")

The detected encoding is: utf-8


1d- Load the dataset with the correct encoding to handle BOM

In [120]:
new_data = pd.read_csv(new_file_path, encoding='utf-8-sig', header=None)

1e- Display the column names to verify they are correctly parsed

In [121]:
print(new_data.columns)

print("\n")

if list(new_data.columns) == [0]:
    print("This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.")
else:
    print(f"The dataset has the following columns: {list(new_data.columns)}")

Index([0, 1], dtype='int64')


The dataset has the following columns: [0, 1]


1f- Display the first few rows of the dataset

In [122]:
print(new_data.head())

     0                                                  1
0  NaN                                               news
1  0.0  Southside Chicago Blacks Fight Against Liberal...
2  1.0  WIFE OF LIONS QUARTERBACK Matthew Stafford Jus...
3  2.0  HEY CNN‚Ä¶REMEMBER OBAMA‚ÄôS Notorious ‚ÄúFrid...
4  3.0  BREAKING NEWS: SEBASTIAN GORKA OUT‚Ä¶Are Ivank...


**Conclusion:**

It looks like the new dataset has some issues with encoding and column names. We will address these issues step by step to make the dataset compatible with the training data.

1g- Re-Load the dataset with the correct encoding

Ensure the dataset is loaded with the correct encoding to handle special characters.

In [123]:
new_data = pd.read_csv(new_file_path, encoding='utf-8-sig', header=None)

1h- Rename Columns

Rename the columns to match the structure of the training data.

In [124]:
new_data.columns = ['index', 'text']

1i- Display the first few rows of the dataset

In [125]:
print(new_data.head())

   index                                               text
0    NaN                                               news
1    0.0  Southside Chicago Blacks Fight Against Liberal...
2    1.0  WIFE OF LIONS QUARTERBACK Matthew Stafford Jus...
3    2.0  HEY CNN‚Ä¶REMEMBER OBAMA‚ÄôS Notorious ‚ÄúFrid...
4    3.0  BREAKING NEWS: SEBASTIAN GORKA OUT‚Ä¶Are Ivank...


1j- Remove the First Row if it Contains Column Names

Check if the first row contains column names and remove it if necessary.

In [126]:
if 'news' in new_data.iloc[0].values:
    new_data = new_data.drop(0)

1k- Check the first few rows of the dataset

In [127]:
print(new_data.head())

   index                                               text
1    0.0  Southside Chicago Blacks Fight Against Liberal...
2    1.0  WIFE OF LIONS QUARTERBACK Matthew Stafford Jus...
3    2.0  HEY CNN‚Ä¶REMEMBER OBAMA‚ÄôS Notorious ‚ÄúFrid...
4    3.0  BREAKING NEWS: SEBASTIAN GORKA OUT‚Ä¶Are Ivank...
5    4.0  First Grader ‚ÄúInvestigated‚Äù in Principal‚Ä...


1l- Reset Index

Reset the index to ensure it starts from 0.

In [128]:
new_data = new_data.reset_index(drop=True)

1m- Check the first few rows of the dataset

In [129]:
print(new_data.head())

   index                                               text
0    0.0  Southside Chicago Blacks Fight Against Liberal...
1    1.0  WIFE OF LIONS QUARTERBACK Matthew Stafford Jus...
2    2.0  HEY CNN‚Ä¶REMEMBER OBAMA‚ÄôS Notorious ‚ÄúFrid...
3    3.0  BREAKING NEWS: SEBASTIAN GORKA OUT‚Ä¶Are Ivank...
4    4.0  First Grader ‚ÄúInvestigated‚Äù in Principal‚Ä...


1n- Handle Special Characters

Replace special characters with their correct representations

In [130]:
new_data['text'] = new_data['text'].str.replace('‚Ä¶', '...')
new_data['text'] = new_data['text'].str.replace('‚Äô', "'")
new_data['text'] = new_data['text'].str.replace('‚Äú', '"')
new_data['text'] = new_data['text'].str.replace('‚Äù', '"')

1o- Display the Cleaned Data

Check the first few rows of the cleaned dataset.

In [132]:
print(new_data.head())

   index                                               text
0    0.0  Southside Chicago Blacks Fight Against Liberal...
1    1.0  WIFE OF LIONS QUARTERBACK Matthew Stafford Jus...
2    2.0  HEY CNN...REMEMBER OBAMA'S Notorious "Friday N...
3    3.0  BREAKING NEWS: SEBASTIAN GORKA OUT...Are Ivank...
4    4.0  First Grader "Investigated" in Principal's Off...


1p- Convert Text to Lowercase

Convert all text to lowercase to maintain consistency.

In [133]:
new_data['text'] = new_data['text'].str.lower()

1q- Display the Cleaned Data

Check the first few rows of the cleaned dataset.

In [134]:
print(new_data.head())

   index                                               text
0    0.0  southside chicago blacks fight against liberal...
1    1.0  wife of lions quarterback matthew stafford jus...
2    2.0  hey cnn...remember obama's notorious "friday n...
3    3.0  breaking news: sebastian gorka out...are ivank...
4    4.0  first grader "investigated" in principal's off...


1r- Add a Placeholder Label Column

Since the new dataset does not have labels, add a placeholder column for labels

In [135]:
new_data['label'] = None

1s- Reorder Columns

Ensure the columns are in the correct order: label and text. The label column has replaced the index column. This is intentional because the index column is not needed for the analysis. The label column is added as a placeholder for future predictions, and the text column contains the news data.

In [136]:
new_data = new_data[['label', 'text']]

1t- Display the Cleaned Data

Check the first few rows of the cleaned dataset.

In [137]:
print(new_data.head())

  label                                               text
0  None  southside chicago blacks fight against liberal...
1  None  wife of lions quarterback matthew stafford jus...
2  None  hey cnn...remember obama's notorious "friday n...
3  None  breaking news: sebastian gorka out...are ivank...
4  None  first grader "investigated" in principal's off...


# 2- Preprocess New Data

2a- Check for missing values in the new dataset

In [138]:
missing_values = new_data.isnull().sum()
print(missing_values)

print("\n")

print(f"This result means that in the column named 'label', the amount of missing values is {missing_values['label']}, and in the column named 'text', the amount of missing values is {missing_values['text']}.")

label    9983
text        0
dtype: int64


This result means that in the column named 'label', the amount of missing values is 9983, and in the column named 'text', the amount of missing values is 0.


2b- Conditionally drop rows with missing values in the 'label' column

Since the label column is intentionally filled with None values as placeholders, we do not need to drop any rows based on missing values in the label column. Instead, we can proceed to the next step.

2c- Check Data After Preprocessing

In [139]:
print(new_data.isnull().sum())
print(new_data.head())
print(new_data.shape)

print("\n")

print(f"This result means that the dataset has {new_data.shape[0]} rows and {new_data.shape[1]} columns. The two columns are 'label' and 'text'.")

label    9983
text        0
dtype: int64
  label                                               text
0  None  southside chicago blacks fight against liberal...
1  None  wife of lions quarterback matthew stafford jus...
2  None  hey cnn...remember obama's notorious "friday n...
3  None  breaking news: sebastian gorka out...are ivank...
4  None  first grader "investigated" in principal's off...
(9983, 2)


This result means that the dataset has 9983 rows and 2 columns. The two columns are 'label' and 'text'.


# 3- Feature Extraction

3a- Transform the New Text Data into Numerical Features Using the Count Vectorizer

In [140]:
new_data_transformed_count = count_vectorizer.transform(new_data['text'])

3b- Transform the New Text Data into Numerical Features Using the TF-IDF Vectorizer

In [141]:
new_data_transformed_tfidf = tfidf_vectorizer.transform(new_data['text'])

# 4- Predict Labels

4a- Predict Labels Using the Count Vectorizer Features

In [142]:
new_predictions_count = model_count.predict(new_data_transformed_count)

4b- Predict Labels Using the TF-IDF Vectorizer Features

In [143]:
new_predictions_tfidf = model_tfidf.predict(new_data_transformed_tfidf)

4c- Add the Predictions to the New Data

In [144]:
new_data['predicted_label_count'] = new_predictions_count
new_data['predicted_label_tfidf'] = new_predictions_tfidf

4d- Display the First Few Rows of the New Data with Predictions

In [145]:
print(new_data.head())

  label                                               text  \
0  None  southside chicago blacks fight against liberal...   
1  None  wife of lions quarterback matthew stafford jus...   
2  None  hey cnn...remember obama's notorious "friday n...   
3  None  breaking news: sebastian gorka out...are ivank...   
4  None  first grader "investigated" in principal's off...   

  predicted_label_count predicted_label_tfidf  
0                     0                     0  
1                     0                     0  
2                     0                     0  
3                     0                     0  
4                     0                     0  
