<a href="https://colab.research.google.com/github/solvedbrunus/Project-2-NLP/blob/main/nlp_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1- Import Libraries

Data Manipulation:

pandas: Provides data structures like DataFrames, which are useful for handling and processing structured data.

In [89]:
import pandas as pd

Feature Extraction:

CountVectorizer: Converts a collection of text documents to a matrix of token counts.

TfidfVectorizer: Converts a collection of raw documents to a matrix of TF-IDF features, which reflect the importance of words in the documents.

In [90]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

Model Training:

MultinomialNB: Implements the Multinomial Naive Bayes algorithm, which is suitable for classification with discrete features (like word counts for text classification).

In [91]:
from sklearn.naive_bayes import MultinomialNB

Model Evaluation:

accuracy_score: Calculates the accuracy of the model by comparing the predicted labels with the true labels.

In [92]:
from sklearn.metrics import accuracy_score

# 2- Load Dataset

2a - Define the file path once

In [93]:
file_path = "D:/OneDrive - Royal HaskoningDHV/920791/Pri 3/ironhack/nlp-project/Project-2-NLP/dataset/training_data_lowercase.csv"

2b- Load the dataset

In [94]:
data = pd.read_csv(file_path)

2c- Install the chardet library detect the encoding programmatically

In [95]:
%pip install chardet




2d- Detect the encoding

In [96]:
import chardet

# Read the first few bytes of the file to detect the encoding
with open(file_path, 'rb') as file:
    raw_data = file.read(10000)
    result = chardet.detect(raw_data)
    encoding = result['encoding']
    print(f"The detected encoding is: {encoding}")

The detected encoding is: UTF-8-SIG


2e- Load the dataset with the correct encoding to handle BOM

In [97]:
data = pd.read_csv(file_path, encoding='utf-8-sig', header=None)

2f- Display the column names to verify they are correctly parsed

In [98]:
print(data.columns)

print("\n")

if list(data.columns) == [0]:
    print("This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.")
else:
    print(f"The dataset has the following columns: {list(data.columns)}")

Index([0], dtype='int64')


This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.


2g- Display the first few rows of the dataset

In [99]:
print(data.head())

                                                   0
0  0\tdonald trump sends out embarrassing new yea...
1  0\tdrunk bragging trump staffer started russia...
2  0\tsheriff david clarke becomes an internet jo...
3  0\ttrump is so obsessed he even has obama‚s na...
4  0\tpope francis just called out donald trump d...


**Conclusion:**

Between the results from step 2f and 2g we have identified that the dataset is not correctly divided between label and text, and that the label is incorporated on each sentence at the begining, therefore we have to separate the characters '0\t' from the main sentences. For this reason re restart the process of loading the dataset and later splitting the sentences after the '0\t' part, which will then become our new label column.

2h- Load the dataset with the correct encoding to handle BOM

In [100]:
data = pd.read_csv(file_path, encoding='utf-8-sig', header=None)

2i- Display the first few rows of the dataset

In [101]:
print(data.head())

                                                   0
0  0\tdonald trump sends out embarrassing new yea...
1  0\tdrunk bragging trump staffer started russia...
2  0\tsheriff david clarke becomes an internet jo...
3  0\ttrump is so obsessed he even has obama‚s na...
4  0\tpope francis just called out donald trump d...


2j- Separate the first part of the sentences with the separator '0\t' and assign it as the label column

In [102]:
data[['label', 'text']] = data[0].str.split('\t', n=1, expand=True)

2k- Drop the original column

In [103]:
data = data.drop(columns=[0])

2l- Display the first few rows of the dataset after removing the first part

In [104]:
print(data.head())

print("\n")

if list(data.columns) == [0]:
    print("This result means that the dataset currently has a single column named '0'. This might indicate that the columns were not correctly parsed or assigned during the data loading or processing steps.")
else:
    print(f"The dataset has the following columns: {list(data.columns)}")

  label                                               text
0     0  donald trump sends out embarrassing new year‚s...
1     0  drunk bragging trump staffer started russian c...
2     0  sheriff david clarke becomes an internet joke ...
3     0  trump is so obsessed he even has obama‚s name ...
4     0  pope francis just called out donald trump duri...


The dataset has the following columns: ['label', 'text']


# 3- Preprocess Data

3a- Check for missing values in the dataset

In [114]:
missing_values = data.isnull().sum()
print(missing_values)

print("\n")

print(f"This result means that in the column named 'label', the amount of missing values is {missing_values['label']}, and in the column named 'text', the amount of missing values is {missing_values['text']}.")

label    0
text     0
dtype: int64


This result means that in the column named 'label', the amount of missing values is 0, and in the column named 'text', the amount of missing values is 0.


3b- Conditionally drop rows with missing values in the 'label' column

In [123]:
if missing_values['label'] > 0:
    data = data.dropna(subset=['label'])

    
    print("Rows with missing values in the 'label' column have been dropped.")
else:
    print("No missing values found in the 'label' column. No rows were dropped.")

No missing values found in the 'label' column. No rows were dropped.


3c- Check Data After Preprocessing

In [115]:
print(data.isnull().sum())
print(data.head())
print(data.shape)

print("\n")

print(f"This result means that the dataset has {data.shape[0]} rows and {data.shape[1]} columns. The two columns are 'label' and 'text'.")

label    0
text     0
dtype: int64
  label                                               text
0     0  donald trump sends out embarrassing new year‚s...
1     0  drunk bragging trump staffer started russian c...
2     0  sheriff david clarke becomes an internet joke ...
3     0  trump is so obsessed he even has obama‚s name ...
4     0  pope francis just called out donald trump duri...
(34152, 2)


This result means that the dataset has 34152 rows and 2 columns. The two columns are 'label' and 'text'.


# 4- Split Data

We split the dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate the model’s performance.

4a- Define the test size and random state as variables

In [119]:
test_size = 0.2
random_state = 42

4b- Split the data into training and testing sets

In [121]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=test_size, random_state=random_state)

print("The dataset has been split into training and testing sets.")
print(f"X_train and y_train are the training data and labels, respectively.")
print(f"X_test and y_test are the testing data and labels, respectively.")
print(f"{test_size * 100}% of the data is used for testing, and {(1 - test_size) * 100}% is used for training.")
print(f"The random_state={random_state} ensures that the split is reproducible, meaning that every time you run the code with the same random_state value, you will get the same split of training and testing data. This is important for consistency and reliability in your results.")

The dataset has been split into training and testing sets.
X_train and y_train are the training data and labels, respectively.
X_test and y_test are the testing data and labels, respectively.
20.0% of the data is used for testing, and 80.0% is used for training.
The random_state=42 ensures that the split is reproducible, meaning that every time you run the code with the same random_state value, you will get the same split of training and testing data. This is important for consistency and reliability in your results.


# 5- Feature Extraction

We use two methods to convert the text data into numerical features: Count Vectorizer and TF-IDF Vectorizer.

5a- Method 1: Count Vectorizer

Converts text into a matrix of token counts.

In [109]:
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

5b- Method 2: TF-IDF Vectorizer

Converts text into a matrix of TF-IDF features, which reflect the importance of words in the documents.

In [110]:
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# 6- Train and Evaluate Model

6a- Using Count Vectorizer features

In [111]:
model_count = MultinomialNB()
model_count.fit(X_train_count, y_train)
y_pred_count = model_count.predict(X_test_count)
accuracy_count = accuracy_score(y_test, y_pred_count)


6b- Using TF-IDF Vectorizer features

In [112]:
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)

6c- Compare the accuracies

In [117]:
if accuracy_tfidf > accuracy_count:
    best_representation = "TF-IDF Vectorizer"
    best_accuracy = accuracy_tfidf
else:
    best_representation = "Count Vectorizer"
    best_accuracy = accuracy_count

print(f"The best feature representation is {best_representation} with an accuracy of {best_accuracy:.4f}.")

print("\n")

if best_representation == "Count Vectorizer":
    print("This means:\n\nCount Vectorizer: Using the word counts as features worked better than using TF-IDF scores.\nAccuracy: The model correctly identified whether the news is fake or real {:.2f}% of the time.".format(best_accuracy * 100))
else:
    print("This means:\n\nTF-IDF Vectorizer: Using the TF-IDF scores as features worked better than using word counts.\nAccuracy: The model correctly identified whether the news is fake or real {:.2f}% of the time.".format(best_accuracy * 100))

The best feature representation is Count Vectorizer with an accuracy of 0.9448.


This means:

Count Vectorizer: Using the word counts as features worked better than using TF-IDF scores.
Accuracy: The model correctly identified whether the news is fake or real 94.48% of the time.
