# Codtech - Data Analyst Internship 

**Task 4 : Sentiment Analysis** 

**Deliverable:** notebook demonstrating sentiment analysis (preprocessing, model, evaluation)

**Dataset:** synthetic 10,000 rows (CSV included)

**Generated on:** 2025-08-10 08:54 UTC

This notebook walks through loading the dataset, basic EDA, preprocessing, TF-IDF + Logistic Regression, evaluation, and brief insights.

In [None]:

# Imports and load dataset
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import make_pipeline

df = pd.read_csv('/mnt/data/sentiment_dataset_10000.csv')
df.head()


## Quick EDA

In [None]:

print('Shape:', df.shape)
print('\nLabel distribution:')
print(df['sentiment'].value_counts(normalize=True))
df['text_length'] = df['text'].str.len()
print('\nText length summary:')
print(df['text_length'].describe())


## Simple preprocessing function

In [None]:

def preprocess(text):
    # lowercase
    text = text.lower()
    # remove URLs and mentions
    text = re.sub(r'http\S+|www\S+|@\w+', '', text)
    # remove punctuation (keep basic emoticons as they are)
    text = re.sub(r'[^a-z0-9\s\u263a-\u263b\u2600-\u26FF]', ' ', text)
    # collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['clean_text'] = df['text'].astype(str).apply(preprocess)
df.head()


## Train/Test split and Model pipeline (TF-IDF + Logistic Regression)

In [None]:

X = df['clean_text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

pipeline = make_pipeline(
    TfidfVectorizer(ngram_range=(1,2), max_features=10000),
    LogisticRegression(max_iter=1000)
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification report:')
print(classification_report(y_test, y_pred))


## Confusion Matrix

In [None]:

import pandas as pd
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred, labels=['positive','negative','neutral'])
cm_df = pd.DataFrame(cm, index=['positive','negative','neutral'], columns=['positive_pred','negative_pred','neutral_pred'])
cm_df


## Brief insights

- This notebook uses a synthetic dataset for learning purposes.
- TF-IDF + Logistic Regression is a fast baseline for sentiment classification.
- For better real-world performance: larger real datasets, embeddings or transformers, improved preprocessing, and hyperparameter tuning.

