# 🧑‍🏫 Task 1 Part 2: Build Your Own Logistic Regression Model for Sentiment Analysis
In this exercise, you will build a **logistic regression model** from scratch to perform sentiment analysis.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like `LogisticRegression` from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Load the Data
**Task:** Use `pandas` to load the dataset from a file named `IMDB_reviews.csv`.

> **Hint:** Use `pd.read_csv()` to load the file and display the first 5 rows.

**Question:** What are the key features and the target variable in this dataset?

In [1]:
import re
import math

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Load the dataset and display the first few rows
data = pd.read_csv("IMDB_Dataset.csv")
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Step 3: Tokenization and Text Cleaning
**Task:** Implement your own function to:
1. Convert all text to lowercase.
2. Remove punctuation and special characters.
3. Split the text into words (tokenization).

> **Hint:** Use Python string methods and list comprehensions.

**Question:** Why is tokenization important for text-based models?

In [1]:
#Write your own tokenizer function
p = ["`","!","@","#","$","%","^","&","*","(",")","_","-","=","+","?",":",";","/","<",">",",",".","|","~"]
def tokenizer(text):
    text = text.lower()
    text = re.sub("[^\\w]", " ",  text).split() # this replaces all special chars with ' '
    
            
    return text

## Step 4: Create a Vocabulary
**Task:** Create a **vocabulary** (a list of unique words) from the tokenized dataset.

> **Hint:** Use a set to store unique words, then convert it to a list.

**Question:** How does vocabulary size affect model performance?

In [7]:
# Your code here

#done in tokenizer itself
words = list()

for i in data["review"]:
    i = tokenizer(i)
    words.append(i)
    

In [8]:
 
# iterate through the sublist using List comprehension
word =  [element for innerList in words for element in innerList]

In [24]:
word = set(word)

In [None]:
model = {}
count1 = 0 
count2 = 0
for x in word:
    model[x] = {"pos" :count1,"neg":count2}
    
for y in model:
    for i in range(0,len(words)):
        for x in words[i]:
            if x ==y:
                if data["sentiment"][i] == "positive":
                    count1 +=1
                else:
                    count2 +=1  

## Step 5: Implement Word Count
**Task:** Calculate and store the number of times each word appears in a particular review for all reviews

In [15]:
# Your code here
def freq(word):
    
    word_frequency = {}
    for token in word:
            if token not in word_frequency.keys():
                word_frequency[token] = 1# if word occured only once put value as 1
            else:
                word_frequency[token] += 1# if word occured more than once then calculate no. of times it occured
    return word_frequency
            
# Example: Write functions to calculate word counts

In [16]:
freq(word)
    

{'one': 53603,
 'of': 289410,
 'the': 667993,
 'other': 18274,
 'reviewers': 493,
 'has': 33038,
 'mentioned': 1079,
 'that': 143879,
 'after': 14984,
 'watching': 9165,
 'just': 35184,
 '1': 4308,
 'oz': 297,
 'episode': 3183,
 'you': 69129,
 'll': 5795,
 'be': 53383,
 'hooked': 284,
 'they': 45383,
 'are': 58387,
 'right': 6529,
 'as': 91750,
 'this': 151002,
 'is': 211082,
 'exactly': 1965,
 'what': 32239,
 'happened': 2054,
 'with': 87368,
 'me': 21457,
 'br': 201951,
 'first': 17583,
 'thing': 9173,
 'struck': 279,
 'about': 34160,
 'was': 95608,
 'its': 16062,
 'brutality': 144,
 'and': 324441,
 'unflinching': 31,
 'scenes': 10482,
 'violence': 2129,
 'which': 23402,
 'set': 4809,
 'in': 186781,
 'from': 40498,
 'word': 1869,
 'go': 9963,
 'trust': 607,
 'not': 60748,
 'a': 322970,
 'show': 12657,
 'for': 87471,
 'faint': 84,
 'hearted': 427,
 'or': 35779,
 'timid': 48,
 'pulls': 373,
 'no': 25292,
 'punches': 127,
 'regards': 140,
 'to': 268124,
 'drugs': 736,
 'sex': 3422,
 'ha

## Step 6: Train-Test Split
**Task:** Split the data into **80% training** and **20% testing** sets.

> **Hint:** Use `numpy` or list slicing to split the data manually.

**Question:** Why do we need to split the data for training and testing?

In [17]:
# Your code here
ratio = 0.80
 
total_rows = data.shape[0]
train_size = int(total_rows*ratio)
 
# Split data into test and train
train = data[0:train_size]
test = data[train_size:]

In [18]:
train

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
39995,This was a marvelously funny comedy with a gre...,positive
39996,There is no plot. There are no central charact...,positive
39997,This show is awesome! I love all the actors! I...,positive
39998,The fact that this movie has been entitled to ...,negative


In [19]:
test

Unnamed: 0,review,sentiment
40000,First off I want to say that I lean liberal on...,negative
40001,I was excited to see a sitcom that would hopef...,negative
40002,When you look at the cover and read stuff abou...,negative
40003,"Like many others, I counted on the appearance ...",negative
40004,"This movie was on t.v the other day, and I did...",negative
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Step 7: Building the Logistic Regression Model (Divided Steps)

### Part 1: The Prediction functions
The **prediction function** returns the predicted value of the data point using the weights and the bias. It uses the sigmoid function to convert the prediction into a value in the range of 0 to 1.

**Task:** Implement the sigmoid and prediction functions

In [80]:
def sigmoid(x):
    s = 1/(1+ pow(math.e,-x))
    return s

def lr_prediction(weights,	bias,	features):
    
 return 

### Part 2: Implementing the Error functions
**Task:** Use the gradient update rules to train the logistic regression model over multiple epochs.

In [None]:
def	log_loss(weights,	bias,	features,	label):
    return

def	total_log_loss(weights,	bias,	X,	y):
    return

### Part 1: Update Weights
The **Update_Weights** adjusts weights and bias based on whether points are correctly or incorrectly classified, It is a simple method of improving the model at every iteration:
1. **Correctly classified points:** Move the line **away** from the point.
2. **Incorrectly classified points:** Move the line **towards** the point.

**Task:** Implement the gradient update function based on these rules.

In [None]:
#Your Code
def	lr_update_weights(weights,	bias,	features,	label,	learning_rate	=	0.01):
    return

### Part 2: Implementing the Logistic Regression Algorithm
**Task:** Use the function to update weights to train the logistic regression model over multiple epochs. Keep track of the total error for each epoch. You will later plot these errors.

In [None]:
# Implement the logistic regression model 
def	lr_algorithm(features,	labels,	learning_rate	=	0.01,	epochs	=	200):
    return

## Step 8: Evaluate Your Model
**Task:** Calculate the accuracy of the model. Compare the predicted labels with the actual labels.

> **Hint:** Use the formula for accuracy: (Correct Predictions / Total Predictions) * 100

**Question:** Which metric—accuracy, precision, or recall—is most important for sentiment analysis?

In [None]:
# Your code here


## Step 8: Visualize the Errors  
**Task:** Create a scatter plot of the total errors over the training epochs. The plot should show a gradual decrease in errors, stabilizing as the model converges.

In [None]:
#Your code here

## Step 9: Make Predictions on New Data
**Task:** Use your trained model to predict the sentiment of the following review:

> _"The movie was absolutely fantastic and kept me hooked till the end."_

**Question:** What challenges might arise when predicting on new data?

In [None]:
# Your code here


## Step 10: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):WAS NOT ABLE TO FIGURE OUT ABT WEIGHTS,BIAS,FEATURES.CAN you provide some more resources on this?