### Step 1: Loading the Dataset

We begin by importing the necessary libraries and loading the `tweets.csv` dataset into a pandas DataFrame. This dataset contains 1.6 million tweets, labeled with sentiment polarity (0 = negative, 4 = positive). It also includes metadata such as tweet ID, timestamp, username, and raw tweet text.

In [6]:
import pandas as pd

# Load dataset
df = pd.read_csv("tweets.csv", encoding='latin-1', header=None)

# Rename columns for clarity
df.columns = ['target', 'id', 'date', 'query', 'user', 'tweet']

# View the structure
print("Shape of dataset:", df.shape)
print("Column names:", df.columns.tolist())
df.head()

Shape of dataset: (1600000, 6)
Column names: ['target', 'id', 'date', 'query', 'user', 'tweet']


Unnamed: 0,target,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Step 2: Data Cleaning

In this step, we’ll clean the text data to prepare it for sentiment analysis. The raw tweets contain noise such as usernames, URLs, special characters, numbers, and excess whitespace. These elements don't contribute to sentiment and should be removed.

We will perform the following cleaning tasks:
- Remove Twitter handles (@username)
- Remove URLs
- Remove numbers
- Remove special characters and punctuation
- Convert text to lowercase
- Remove extra spaces

We’ll also create a new column called `clean_tweet` to store the cleaned version of each tweet.

In [7]:
import re

def clean_tweet(text):
    text = re.sub(r'@[\w]+', '', text)                   # remove @mentions
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)  # remove URLs
    text = re.sub(r'\d+', '', text)                      # remove numbers
    text = re.sub(r'[^\w\s]', '', text)                  # remove punctuation
    text = text.lower()                                  # convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip()             # remove extra spaces
    return text

# Apply cleaning function
df['clean_tweet'] = df['tweet'].apply(clean_tweet)

# View cleaned tweets
df[['tweet', 'clean_tweet']].head()

Unnamed: 0,tweet,clean_tweet
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",a thats a bummer you shoulda got david carr of...
1,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...
2,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...
3,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire
4,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...


### Step 3: Text Preprocessing

Now that the tweets are cleaned, we’ll preprocess the text to convert it into a more meaningful format for sentiment analysis. This involves:

- **Tokenization**: Breaking down the tweet into individual words.
- **Stopwords Removal**: Removing common words like "the", "is", etc., which don't carry much sentiment.
- **Lemmatization**: Reducing words to their root form (e.g., "running" → "run").

We'll use the **NLTK** library for stopwords and lemmatization.

A new column called `preprocessed_tweet` will store the final version of processed tweets.

In [8]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to C:\Users\Ridhwan
[nltk_data]     Salim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Ridhwan
[nltk_data]     Salim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Ridhwan
[nltk_data]     Salim\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Ridhwan
[nltk_data]     Salim\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [9]:
import nltk
import os

# Manually set the NLTK data path
nltk_data_dir = r"e:\Study\internships\Coding Samurai\CODING-SAMURAI-INTERNSHIP-TASK\Level3\Project6_Sentiment_Tweets\nltk_data"
nltk.data.path.append(nltk_data_dir)

# Download and store resources to that path (if not already present)
nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('stopwords', download_dir=nltk_data_dir)
nltk.download('wordnet', download_dir=nltk_data_dir)
nltk.download('omw-1.4', download_dir=nltk_data_dir)

[nltk_data] Downloading package punkt to e:\Study\internships\Coding
[nltk_data]     Samurai\CODING-SAMURAI-INTERNSHIP-
[nltk_data]     TASK\Level3\Project6_Sentiment_Tweets\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     e:\Study\internships\Coding Samurai\CODING-SAMURAI-
[nltk_data]     INTERNSHIP-
[nltk_data]     TASK\Level3\Project6_Sentiment_Tweets\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to e:\Study\internships\Coding
[nltk_data]     Samurai\CODING-SAMURAI-INTERNSHIP-
[nltk_data]     TASK\Level3\Project6_Sentiment_Tweets\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to e:\Study\internships\Coding
[nltk_data]     Samurai\CODING-SAMURAI-INTERNSHIP-
[nltk_data]     TASK\Level3\Project6_Sentiment_Tweets\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Step 4: Feature Extraction Using TF-IDF

To transform the cleaned text into a numerical format suitable for machine learning, we apply TF-IDF vectorization. This method assigns weights to words based on their frequency in individual tweets (TF) and their rarity across the entire dataset (IDF). Words that are frequent in one tweet but rare across others get higher scores.

We limit the number of features to 5000 for performance and relevance.

In [10]:
import nltk
import shutil
import os

# Remove existing NLTK data
nltk_data_path = os.path.expanduser('~/nltk_data')
if os.path.exists(nltk_data_path):
    shutil.rmtree(nltk_data_path)

# Download resources again
nltk.download('all')  # Downloads all NLTK resources

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t.lower()) for t in tokens]
    return ' '.join(tokens)

df['preprocessed_tweet'] = df['clean_tweet'].apply(preprocess_text)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\Ridhwan
[nltk_data]    |     Salim\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to C:\Users\Ridhwan
[nltk_data]    |     Salim\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Ridhwan
[nltk_data]    |     Salim\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Ridhwan
[nltk_data]    |     Salim\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Ridhwan
[nltk_data

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\Ridhwan Salim/nltk_data'
    - 'c:\\Users\\Ridhwan Salim\\AppData\\Local\\Programs\\Python\\Python313\\nltk_data'
    - 'c:\\Users\\Ridhwan Salim\\AppData\\Local\\Programs\\Python\\Python313\\share\\nltk_data'
    - 'c:\\Users\\Ridhwan Salim\\AppData\\Local\\Programs\\Python\\Python313\\lib\\nltk_data'
    - 'C:\\Users\\Ridhwan Salim\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'e:\\Study\\internships\\Coding Samurai\\CODING-SAMURAI-INTERNSHIP-TASK\\Level3\\Project6_Sentiment_Tweets\\nltk_data'
    - 'e:\\Study\\internships\\Coding Samurai\\CODING-SAMURAI-INTERNSHIP-TASK\\Level3\\Project6_Sentiment_Tweets\\nltk_data'
**********************************************************************


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['preprocessed_tweet'])

y = df['target']
print("TF-IDF matrix shape:", X.shape)

KeyError: 'preprocessed_tweet'