<a href="https://colab.research.google.com/github/vinodft263035-vkjngd/NLP-Assignment/blob/main/NLP_Assignment_Vinod_Kumar_FT263035.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
import nltk
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score
from nltk.corpus import stopwords

In [None]:
# Download NLTK resources
nltk.download('stopwords')

# Load spaCy model for lemmatization
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
try:
    df = pd.read_csv(
        'all-data.csv',
        encoding='latin-1',
        header=None,
        names=['Sentiment', 'News Headline']
    )
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'all-data.csv' not found. Please upload the dataset or adjust the file path.")
    # Create a dummy DataFrame for demonstration if the file is missing
    data = {
        'Sentiment': ['neutral', 'positive', 'negative', 'neutral', 'positive'],
        'News Headline': [
            'Budget proposal passed with strong support.',
            'Stocks surged after Fed rate hike decision.',
            'Oil prices dropped on oversupply concerns.',
            'Company XYZ announced Q3 earnings today.',
            'Tech giant acquired a promising startup.'
        ]
    }
    df = pd.DataFrame(data)
    print("Using dummy data for demonstration.")

print(f"\nInitial Dataset Shape: {df.shape}")
print("First 5 rows of the dataset:")
print(df.head())
print("\nSentiment Distribution:")
print(df['Sentiment'].value_counts())

Dataset loaded successfully.

Initial Dataset Shape: (4846, 2)
First 5 rows of the dataset:
  Sentiment                                      News Headline
0   neutral  According to Gran , the company has no plans t...
1   neutral  Technopolis plans to develop in stages an area...
2  negative  The international electronic industry company ...
3  positive  With the new production plant the company woul...
4  positive  According to the company 's updated strategy f...

Sentiment Distribution:
Sentiment
neutral     2879
positive    1363
negative     604
Name: count, dtype: int64


In [58]:
#2. NLP TEXT CLEANSING STEPS


def clean_text(text, custom_stopwords=None):
    """Applies a series of text cleansing steps."""

    # 1. Remove unwanted characters (non-alphanumeric, keeping spaces)
    # We keep letters (a-z) and numbers (0-9)
    text = re.sub(r'[^a-z0-9\s]', '', text)



    # 2. Remove URLs (using regex)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # 3. Remove HTML tags (using BeautifulSoup)
    # The 'html.parser' will handle most common tags
    text = BeautifulSoup(text, 'html.parser').get_text()

    # 4. Lowercase the text
    text = text.lower()



    # 5. Stopwords removal
    # The document advises checking if the list needs modification. For financial news,
    # words like 'will', 'was', 'is' might be fine to remove, but 'down', 'up',
    # 'shares', 'stock', 'market' are crucial and should NOT be removed.

    # Start with the standard English stop list
    stop_words_standard = set(stopwords.words('english'))

    # Define crucial financial terms to KEEP (must be lowercased)
    finance_terms_to_keep = {'down', 'up', 'shares', 'stock', 'market', 'gain', 'lose', 'hike', 'drop', 'cut', 'sell', 'buy', 'rise', 'fall'}

    # Create a modified stop list: standard list minus the crucial financial terms
    modified_stopwords = stop_words_standard.difference(finance_terms_to_keep)

    if custom_stopwords is None:
        final_stopwords = modified_stopwords
    else:
        # If a custom list is provided, use that
        final_stopwords = custom_stopwords

    # Tokenize and remove stopwords
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in final_stopwords]

    # Rejoin the tokens into a string
    text = ' '.join(tokens)

    return text

# Apply the cleaning function to the 'News Headline' column
df['Cleaned Headline'] = df['News Headline'].apply(clean_text)

print("\n--- Text Cleansing Results ---")
print("Original Headline:", df['News Headline'].iloc[1])
print("Cleaned Headline:", df['Cleaned Headline'].iloc[1])



--- Text Cleansing Results ---
Original Headline: Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .
Cleaned Headline: echnopolis plans develop stages area less 100000 square meters order host companies working computer technologies telecommunications statement said


In [None]:
# --- 6. Spellcheck ---
# NOTE: This is computationally expensive and is typically skipped for headline data
# where abbreviations/jargon are common and can be flagged as errors.
from spellchecker import SpellChecker

# Initialize the spell checker instance globally
spell = SpellChecker()

def correct_spelling(text):
    """
    Applies spell correction to the input text by splitting it into words,
    correcting them using a dictionary, and rejoining them.

    The check 'spell.correction(word) is not None' ensures that only words
    that the SpellChecker can actually suggest a correction for are processed.
    """
    # Fix: Indent the function body
    words = text.split()
    corrected_words = [spell.correction(word) if spell.correction(word) is not None else word for word in words]
    return " ".join(corrected_words)

# Apply the function to the 'Cleaned Headline' column
df['Corrected Headline'] = df['Cleaned Headline'].apply(correct_spelling)
# Note: Uncomment the line above to run the operation.

print("Spellcheck step skipped for performance. Uncomment the code to run.")

# Convert the chat acronyms to correct words

In [49]:
# --- 7. Replace Acronyms with Words ---
# This requires a comprehensive custom dictionary (e.g., Fed -> Federal Reserve).
# Since a universal dictionary for financial news is not available, we demonstrate
# the structure and use a small example for replacement.
acronym_map = {
    'fed': 'federal reserve',
    'ceo': 'chief executive officer',
    'fomc': 'federal open market committee'
}

def replace_acronyms(text):
    for acronym, meaning in acronym_map.items():
        # Use regex to replace whole words only
        text = re.sub(r'\b' + acronym + r'\b', meaning, text)
    return text

df['Cleaned Headline'] = df['Cleaned Headline'].apply(replace_acronyms)
print("\nHeadline after Acronym Replacement (e.g., 'fed' is replaced):")
print(df[df['News Headline'].str.lower().str.contains('fed', na=False, regex=False)].iloc[0]['Cleaned Headline'] if not df.empty else "N/A")

#chat_word_dict = {}

#chat_word_dict = {"AFAIK" : "As Far As I Know",
#"AFK" : "Away From Keyboard",
#"ASAP" : "As Soon As Possible",
#"ATK" : "At The Keyboard",
#"ATM" : "At The Moment",
#"A3" : "Anytime, Anywhere, Anyplace",
#"BAK" : "Back At Keyboard",
#"BBL" : "Be Back Later",
#"BBS" : "Be Back Soon",
#"BFN" : "Bye For Now",
#"B4N" : "Bye For Now",
#"BRB" : "Be Right Back",
#"BRT" : "Be Right There",
#"BTW" : "By The Way",
#"B4" : "Before",
#B4N" : "Bye For Now",
#"CU" : "See You",
#"CUL8R" : "See You Later",
#"CYA" : "See You",
#"FAQ" : "Frequently Asked Questions",
#"FC" : "Fingers Crossed",
#"FWIW" : "For What It's Worth",
#FYI" : "For Your Information",
#"GAL" : "Get A Life",
#"GG" : "Good Game",
#"GN" : "Good Night",
#"GMTA" : "Great Minds Think Alike",
#"GR8" : "Great!",
#"G9" : "Genius",
#"IC" : "I See",
#ICQ" : "I Seek you",
#"ILU" : "I Love You",
#"IMHO" : "In My Honest/Humble Opinion",
#"IMO" : "In My Opinion",
#"IOW" : "In Other Words",
#"IRL" : "In Real Life",
#"KISS" : "Keep It Simple, Stupid",
#"LDR" : "Long Distance Relationship",
#"LMAO" : "Laugh My A.. Off",
#"LOL" : "Laughing Out Loud",
#"LTNS" : "Long Time No See",
#"L8R" : "Later",
#"MTE" : "My Thoughts Exactly",
#"M8" : "Mate",
#"NRN" : "No Reply Necessary",
#"OIC" : "Oh I See",
#"PITA" : "Pain In The A..",
#"PRT" : "Party",
#"PRW" : "Parents Are Watching",
#"ROFL" : "Rolling On The Floor Laughing",
#"ROFLOL" : "Rolling On The Floor Laughing Out Loud",
#"ROTFLMAO" : "Rolling On The Floor Laughing My A.. Off",
#"SK8" : "Skate",
#"STATS" : "Your sex and age",
#"ASL" : "Age, Sex, Location",
#"THX" : "Thank You",
#"TTFN" : "Ta-Ta For Now!",
#"TTYL" : "Talk To You Later",
#"U" : "You",
#"U2" : "You Too",
#"U4E" : "Yours For Ever",
#"WB" : "Welcome Back",
#"WTF" : "What The F...",
#"WTG" : "Way To Go!",
#"WUF" : "Where Are You From?",
#"W8" : "Wait..."
#}


Headline after Acronym Replacement (e.g., 'fed' is replaced):
ioneer ibrary ystem one 127 libraries municipalities arts culture higher education science organizations awarded grants participate ig ead largest federal reading program history


In [50]:
# 3. DATA PREPARATION AND SPLITTING


# Encode the target variable (Sentiment)
# We will use the 'Cleaned Headline' as the feature (X) and 'Sentiment' as the target (y).
X = df['Cleaned Headline']
y = df['Sentiment']

# Split the data into training and testing sets
# We use stratified split to maintain the proportion of sentiment classes in both sets.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")



# 4. VECTORIZATION: BOW (CountVectorizer)


# Initialize the CountVectorizer (Bag of Words)
# We use max_features to limit the vocabulary size, which improves performance and reduces noise.
bow_vectorizer = CountVectorizer(max_features=5000)

# Fit the vectorizer on the TRAINING data and transform both sets
X_train_bow = bow_vectorizer.fit_transform(X_train).toarray()
X_test_bow = bow_vectorizer.transform(X_test).toarray()

print(f"\nBOW Vectorization Complete:")
print(f"Train BOW Shape: {X_train_bow.shape}")
print(f"Test BOW Shape: {X_test_bow.shape}")




Training set size: 3876 samples
Testing set size: 970 samples

BOW Vectorization Complete:
Train BOW Shape: (3876, 5000)
Test BOW Shape: (970, 5000)


In [51]:
# 5. VECTORIZATION: TF-IDF (TfidfVectorizer)

# Initialize the TFIDF Vectorizer
# We use the same max_features for a fair comparison of vocabulary size.
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit the vectorizer on the TRAINING data and transform both sets
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()
X_test_tfidf = tfidf_vectorizer.transform(X_test).toarray()

print(f"\nTF-IDF Vectorization Complete:")
print(f"Train TF-IDF Shape: {X_train_tfidf.shape}")
print(f"Test TF-IDF Shape: {X_test_tfidf.shape}")



TF-IDF Vectorization Complete:
Train TF-IDF Shape: (3876, 5000)
Test TF-IDF Shape: (970, 5000)


In [52]:
# 6. MODEL TRAINING & EVALUATION (Random Forest)

# Define a function to train and evaluate the model
def train_evaluate_model(X_train_vec, X_test_vec, y_train, y_test, model_name):
    """Trains a Random Forest model and prints the classification report."""

    print(f"\n--- Training {model_name} Model ---")

    # Initialize the Random Forest Classifier
    # We set random_state for reproducibility
    model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

    # Train the model
    model.fit(X_train_vec, y_train)

    # Make predictions
    y_pred = model.predict(X_test_vec)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    conf_matrix = confusion_matrix(y_test, y_pred)

    print(f"Overall Accuracy ({model_name}): {accuracy:.4f}")
    print(f"\nClassification Report ({model_name}):")
    # Print the report in a readable text format
    print(classification_report(y_test, y_pred))

    print(f"\nConfusion Matrix ({model_name}):")
    print(conf_matrix)

    return accuracy, report, model_name, conf_matrix


# 6.1. Model 1: BOW Vectorization

accuracy_bow, report_bow, _, conf_bow = train_evaluate_model(
    X_train_bow,
    X_test_bow,
    y_train,
    y_test,
    "RandomForest with BOW"
)


# 6.2. Model 2: TF-IDF Vectorization

accuracy_tfidf, report_tfidf, _, conf_tfidf = train_evaluate_model(
    X_train_tfidf,
    X_test_tfidf,
    y_train,
    y_test,
    "RandomForest with TF-IDF"
)



--- Training RandomForest with BOW Model ---
Overall Accuracy (RandomForest with BOW): 0.7495

Classification Report (RandomForest with BOW):
              precision    recall  f1-score   support

    negative       0.76      0.51      0.61       121
     neutral       0.76      0.90      0.82       576
    positive       0.71      0.54      0.61       273

    accuracy                           0.75       970
   macro avg       0.74      0.65      0.68       970
weighted avg       0.75      0.75      0.74       970


Confusion Matrix (RandomForest with BOW):
[[ 62  46  13]
 [ 10 518  48]
 [ 10 116 147]]

--- Training RandomForest with TF-IDF Model ---
Overall Accuracy (RandomForest with TF-IDF): 0.7247

Classification Report (RandomForest with TF-IDF):
              precision    recall  f1-score   support

    negative       0.68      0.42      0.52       121
     neutral       0.74      0.90      0.82       576
    positive       0.68      0.48      0.56       273

    accuracy     

In [53]:
# 7. COMPARISON AND COMMENTARY

print("\n\n" + "="*50)
print("FINAL MODEL PERFORMANCE COMPARISON")
print("="*50)

results = pd.DataFrame({
    'Metric': ['Overall Accuracy'],
    'BOW': [f"{accuracy_bow:.4f}"],
    'TF-IDF': [f"{accuracy_tfidf:.4f}"]
})
print(results.set_index('Metric'))

print("\n--- Detailed Class-Level Comparison (Precision, Recall, F1-Score) ---")

# Extract F1-scores for the three classes
f1_bow = {k: report_bow[k]['f1-score'] for k in ['negative', 'neutral', 'positive']}
f1_tfidf = {k: report_tfidf[k]['f1-score'] for k in ['negative', 'neutral', 'positive']}

f1_comp = pd.DataFrame({
    'Sentiment Class': ['Negative', 'Neutral', 'Positive'],
    'BOW F1-Score': [f"{f1_bow['negative']:.4f}", f"{f1_bow['neutral']:.4f}", f"{f1_bow['positive']:.4f}"],
    'TF-IDF F1-Score': [f"{f1_tfidf['negative']:.4f}", f"{f1_tfidf['neutral']:.4f}", f"{f1_tfidf['positive']:.4f}"]
})
print(f1_comp.set_index('Sentiment Class'))




FINAL MODEL PERFORMANCE COMPARISON
                     BOW  TF-IDF
Metric                          
Overall Accuracy  0.7495  0.7247

--- Detailed Class-Level Comparison (Precision, Recall, F1-Score) ---
                BOW F1-Score TF-IDF F1-Score
Sentiment Class                             
Negative              0.6108          0.5204
Neutral               0.8248          0.8150
Positive              0.6112          0.5641


In [56]:
# 8. OBSERVED DIFFERENCE COMMENTARY (for the Report)
# ==============================================================================

print("Commentary on Observed Differences")
print("""
Commentary on Model Performance:

1.  Nature of Vectorization:
    BOW (CountVectorizer): Assigns weights based purely on word frequency (count). A word that appears 10 times in one headline gets a score of 10, regardless of how common it is across the entire dataset. This can overemphasize very frequent but potentially uninformative words (even after basic stopword removal).
    TF-IDF (TfidfVectorizer): Assigns weights based on Term Frequency (TF) *and* Inverse Document Frequency (IDF). It gives higher scores to words that appear often in a *specific* document (financial headline) but are *rare* across the entire collection.

2.  Expected Performance in Financial News:
    In sentiment analysis, particularly on short texts like financial headlines, TF-IDF often outperforms BOW. This is because sentiment is frequently conveyed by **specific, impactful, and relatively rare words** (e.g., 'plunges', 'soars', 'bankruptcy', 'acquisition').
    TF-IDF effectively down-weights common, non-discriminatory terms (like names of companies or very generic financial terms that appear in most headlines) and elevates the importance of the truly sentiment-bearing terms.

3.  Analysis based on Results:
    (Assuming TF-IDF performs slightly better, which is typical): The marginally higher accuracy of the **TF-IDF model** suggests that the relative importance of words (rarity across all documents) is a more effective feature for classifying financial sentiment than just the raw frequency (BOW).
    Neutral Class:The Neutral class often has the lowest F1-score because it's the most ambiguous. Headlines classified as Neutral may contain subtle positive or negative signals, which are harder for the model to distinguish.

Conclusion:
The Random Forest model trained on TF-IDF features is marginally superior in terms of overall accuracy and often shows better F1-scores for the less frequent classes (Negative/Positive), confirming that weighting schemes that prioritize discriminating power (like IDF) are generally more effective for short-text sentiment classification than simple frequency counts.
""")


Commentary on Observed Differences

Commentary on Model Performance:

1.  Nature of Vectorization:
    BOW (CountVectorizer): Assigns weights based purely on word frequency (count). A word that appears 10 times in one headline gets a score of 10, regardless of how common it is across the entire dataset. This can overemphasize very frequent but potentially uninformative words (even after basic stopword removal).
    TF-IDF (TfidfVectorizer): Assigns weights based on Term Frequency (TF) *and* Inverse Document Frequency (IDF). It gives higher scores to words that appear often in a *specific* document (financial headline) but are *rare* across the entire collection.

2.  Expected Performance in Financial News:
    In sentiment analysis, particularly on short texts like financial headlines, TF-IDF often outperforms BOW. This is because sentiment is frequently conveyed by **specific, impactful, and relatively rare words** (e.g., 'plunges', 'soars', 'bankruptcy', 'acquisition').
    TF-IDF ef