Got it! Here's a structured explanation and tutorial content on **Sentiment Analysis**, including challenges, handling emoticons, and an example of building a Sentiment Analysis model using an ANN.

---

# Sentiment Analysis

### What is Sentiment Analysis?

Sentiment Analysis, also known as opinion mining, is the process of identifying and extracting subjective information from text. It aims to determine the sentiment expressed by the writer, such as positive, negative, or neutral feelings.

**Example use cases:**

* Analyzing customer reviews to understand satisfaction
* Monitoring social media for public opinion on products or events
* Automating feedback categorization for support tickets

---

### Challenges in Sentiment Analysis

1. **Context and Sarcasm:**
   Sentiment can change depending on context or sarcasm, which are hard to detect automatically.

2. **Ambiguity:**
   Words can have different sentiment depending on usage (e.g., “sick” could be negative or positive slang).

3. **Domain Dependence:**
   Sentiment words can vary in meaning by domain (e.g., "unpredictable" might be negative for a car but positive for a thriller movie).

4. **Handling Negations:**
   Negations like "not good" flip sentiment and must be carefully interpreted.

5. **Emoticons and Emojis:**
   They carry emotional meaning and need special handling to be included correctly.

---

### Handling Emoticons

Emoticons (like `:)`, `:(`, `:D`) and emojis (😊, 😢) often express emotions explicitly and can improve sentiment detection if correctly interpreted.

**Techniques:**

* **Mapping emoticons to sentiment scores:**
  Replace emoticons with tokens indicating positive/negative sentiment.

* **Using Unicode emoji libraries:**
  Convert emojis into textual descriptions or sentiment scores.

* **Augmenting datasets:**
  Include emoticons as features in training data.

---


In [None]:
import mlflow
import pandas as pd
import mlflow.sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import numpy as np
# import the data 
df = pd.read_csv('IMDB.csv')
df = df.sample(500)
df.to_csv('data.csv', index=False)
df.head()

# data preprocessing

# Define text preprocessing functions
def lemmatization(text):
    """Lemmatize the text."""
    lemmatizer = WordNetLemmatizer()
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text]
    return " ".join(text)

def remove_stop_words(text):
    """Remove stop words from the text."""
    stop_words = set(stopwords.words("english"))
    text = [word for word in str(text).split() if word not in stop_words]
    return " ".join(text)

def removing_numbers(text):
    """Remove numbers from the text."""
    text = ''.join([char for char in text if not char.isdigit()])
    return text

def lower_case(text):
    """Convert text to lower case."""
    text = text.split()
    text = [word.lower() for word in text]
    return " ".join(text)

def removing_punctuations(text):
    """Remove punctuations from the text."""
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = text.replace('؛', "")
    text = re.sub('\s+', ' ', text).strip()
    return text

def removing_urls(text):
    """Remove URLs from the text."""
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def normalize_text(df):
    """Normalize the text data."""
    try:
        df['review'] = df['review'].apply(lower_case)
        df['review'] = df['review'].apply(remove_stop_words)
        df['review'] = df['review'].apply(removing_numbers)
        df['review'] = df['review'].apply(removing_punctuations)
        df['review'] = df['review'].apply(removing_urls)
        df['review'] = df['review'].apply(lemmatization)
        return df
    except Exception as e:
        print(f'Error during text normalization: {e}')
        raise

df = normalize_text(df)
df.head()
df['sentiment'].value_counts()
df
# x = df['sentiment'].isin(['positive','negative'])
# df = df[x]
df['sentiment'] = df['sentiment'].map({'positive':1, 'negative':0})
df.head()
df.isnull().sum()
# change text data into number data
vectorizer = CountVectorizer(max_features=100)
X = vectorizer.fit_transform(df['review'])
y = df['sentiment']
df["review"][651]
X.toarray()[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# classification problem 
model = LogisticRegression()
model.fit(X_train, y_train)

x_predict = model.predict(X_test)

x_predict
accuracy_score(y_test, x_predict)
df
df["review"][799]
rahul = "normally spike lee fan take time really get mojo see clear message ability tell story close heart lee genius unlike th hour bamboozled two favorite film his clear story film able understand struggle washington choice play well influenced others odd reason lee never able get true feeling out washington decent job handed him could tell lee s favorite film lee direct film also wrote it could tell camera work horrid writing contributed decay film film coming full circle going pretty lee behind film right thing film seen lee direct brightest modest film almost created hollywood movie instead one own know saw money right thing ran it film demonstrate true talent br br for anyone seen film perhaps stopped watching anything directed spike lee afterwards due film suggest give second chance get wrong see exactly coming film would want put behind you lee grow up work becomes own see transformation desire make money wanting make good film took awhile watch th hour did sheer brilliance perhaps actor perhaps story lee crafted amazing film one man s journey unknown guess hoping mo better blue would turn be really dark journey life man really never grew up instead got denzel denzel really one versatile actor generation consider sydney poitier cinema film showcase talent br br another issue film use spike s sister playing one love interest know you family think could filmed sex scene sister care actor much money getting paid would never it something never wish see apparently different spike went ahead showed full nude image sister without remorse sad even made blush also need somebody answer this flavor flav introducing film so sitting couch ready start film suddenly voice past spelling studio made film acknowledges himself build strong remaining story again felt lee going money film instead actual talent perhaps could afford denzel wesley movie without explosion br br there two great scene film made worth watching end get wrong bad movie always diamond every alleyway scene bleek accidentally forgets woman mesmerizing continually went back forth weaving truth confusion way proved lee actually behind camera visionary scene probably lost shuffle due remaining poor scene scene worth watching way lee introduced ended film keeping pacing direction able bring tragic character around full circle give chance change life two moment rest film pure rubbish worth viewing unless go blind br br grade'"
# 2. Transform your input

df1 = pd.DataFrame({'review': [rahul]})

df11 = normalize_text(df1)  # Apply your preprocessing pipeline
df1
# Transform using the same vectorizer
X_input = vectorizer.transform(df1['review'])

X_input.toarray()
X_input.shape
# 3. Predict using your trained model
prediction = model.predict(X_input)

if prediction[0]==0:
    print(rahul ,"Negative review")
else:
    print(rahul ,"Positive review")

# 3. Predict using your trained model
prediction = model.predict(X_input)

if prediction[0]==0:
    print(rahul ,"Negative review")
else:
    print(rahul ,"Positive review")






logging.info("Initializing Logistic Regression model...")
model = LogisticRegression(max_iter=1000)  # Increase max_iter to prevent non-convergence issues

logging.info("Fitting the model...")
model.fit(X_train, y_train)
logging.info("Model training complete.")

logging.info("Logging model parameters...")
mlflow.log_param("model", "Logistic Regression")

logging.info("Making predictions...")
y_pred = model.predict(X_test)

logging.info("Calculating evaluation metrics...")
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)




import dagshub

mlflow.set_tracking_uri('https://dagshub.com/vikashdas770/YT-Capstone-Project.mlflow')
dagshub.init(repo_owner='vikashdas770', repo_name='YT-Capstone-Project', mlflow=True)

# mlflow.set_experiment("Logistic Regression Baseline")
mlflow.set_experiment("Logistic Regression Baseline")

import mlflow
import logging
import os
import time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

logging.info("Starting MLflow run...")

with mlflow.start_run():
    start_time = time.time()
    
    try:
        logging.info("Logging preprocessing parameters...")
        mlflow.log_param("vectorizer", "Bag of Words")
        mlflow.log_param("num_features", 100)
        mlflow.log_param("test_size", 0.25)

        logging.info("Initializing Logistic Regression model...")
        model = LogisticRegression(max_iter=1000)  # Increase max_iter to prevent non-convergence issues

        logging.info("Fitting the model...")
        model.fit(X_train, y_train)
        logging.info("Model training complete.")

        logging.info("Logging model parameters...")
        mlflow.log_param("model", "Logistic Regression")

        logging.info("Making predictions...")
        y_pred = model.predict(X_test)

        logging.info("Calculating evaluation metrics...")
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        logging.info("Logging evaluation metrics...")
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)

        logging.info("Saving and logging the model...")
        mlflow.sklearn.log_model(model, "model")

        # Log execution time
        end_time = time.time()
        logging.info(f"Model training and logging completed in {end_time - start_time:.2f} seconds.")

        # Save and log the notebook
        # notebook_path = "exp1_baseline_model.ipynb"
        # logging.info("Executing Jupyter Notebook. This may take a while...")
        # os.system(f"jupyter nbconvert --to notebook --execute --inplace {notebook_path}")
        # mlflow.log_artifact(notebook_path)

        # logging.info("Notebook execution and logging complete.")

        # Print the results for verification
        logging.info(f"Accuracy: {accuracy}")
        logging.info(f"Precision: {precision}")
        logging.info(f"Recall: {recall}")
        logging.info(f"F1 Score: {f1}")

    except Exception as e:
        logging.error(f"An error occurred: {e}", exc_info=True)



In [None]:
## Naive Bayes Classifier
gnb = GaussianNB() 
%time gnb.fit(X_train, y_train)

y_pred_train = gnb.predict(X_train)
y_pred_test = gnb.predict(X_test)
print("\nTraining Accuracy score:",accuracy_score(y_train, y_pred_train))
print("Testing Accuracy score:",accuracy_score(y_test, y_pred_test))

In [None]:
print(classification_report(y_test, y_pred_test, target_names=['not relevant', 'relevant']))

In [None]:
cm = confusion_matrix(y_test, y_pred_test)
# print('Confusion matrix\n', cm)

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive', 'Actual Negative'], 
                        index=['Predict Positive', 'Predict Negative'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

In [None]:
probs = gnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# also use MultinomialNB ,LogisticRegression,LinearSVC,DecisionTreeClassifier,ensemble
from sklearn.ensemble import VotingClassifier

classifiers = [('Decision Tree', dt),
               ('Logistic Regression', lr),
                ('Naive Bayes', gnb)
              ]
vc = VotingClassifier(estimators=classifiers)
# Fit 'vc' to the traing set and predict test set labels
vc.fit(X_train, y_train)
y_pred_train=vc.predict(X_train)
y_pred_test = vc.predict(X_test)
print("Training Accuracy score:",accuracy_score(y_train, y_pred_train))
print("Testing Accuracy score:",accuracy_score(y_test, y_pred_test))

# Sentiment Analysis with ANN (Artificial Neural Network)

In [None]:



### Step 1: Data Preparation

* Collect a labeled dataset (e.g., movie reviews with positive/negative labels).
* Clean the text (lowercase, remove noise).
* Handle emoticons by replacing them with sentiment tokens.
* Convert text to numeric vectors (TF-IDF, CountVectorizer, or embeddings).

---

### Step 2: ANN Model Architecture

* Input layer size = number of features
* Hidden layers (e.g., 1 or 2 layers with ReLU activation)
* Output layer with 1 neuron and sigmoid activation (for binary sentiment classification)



### Step 3: Code Example (Using Keras)


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Sample data
texts = [
    "I love this product! :)",
    "This is the worst movie I've ever seen :(",
    "Absolutely fantastic experience!",
    "I do not like this at all :/",
    "Such a boring day..."
]
labels = [1, 0, 1, 0, 0]  # 1 = positive, 0 = negative

# Preprocessing function to handle emoticons
def preprocess_text(text):
    emoticon_dict = {
        ':)': ' happy ',
        ':(': ' sad ',
        ':D': ' happy ',
        ':/': ' disappointed ',
        ':-)': ' happy ',
        ':-(': ' sad '
    }
    for emot in emoticon_dict:
        text = text.replace(emot, emoticon_dict[emot])
    return text.lower()

texts = [preprocess_text(t) for t in texts]

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts).toarray()
y = np.array(labels)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define ANN model
model = Sequential([
    Dense(16, input_dim=X.shape[1], activation='relu'),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

# Train model
model.fit(X_train, y_train, epochs=20, batch_size=2, verbose=1)

# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy*100:.2f}%")
```

---

### Explanation

* We replace emoticons with text tokens that the vectorizer can pick up.
* We use TF-IDF to convert text to numeric vectors.
* The ANN has two hidden layers.
* The output layer uses sigmoid activation to output a probability for positive sentiment.
* We train and evaluate the model on a small dataset.

---

### Assignment Ideas

1. Collect a larger dataset of social media comments.
2. Add more emoticon handling rules.
3. Try word embeddings (Word2Vec, GloVe) instead of TF-IDF.
4. Experiment with different ANN architectures (more layers, dropout).
5. Evaluate the model on a test set and analyze errors.