# DATA ANALYSIS AND DATA SCIENCE WITH PYTHON
## Task 5 – Classification Tasks Overview

This notebook contains the complete implementation of **Task 5**:

1. **Student Pass/Fail Prediction** using Logistic Regression
2. **Sentiment Analysis with Natural Language Processing (NLP)** using Logistic Regression

The work is done in a clean, step‑by‑step, and professional manner, ready for submission and upload to GitHub.

## Table of Contents
1. [Setup and Common Imports](#setup)
2. [Task 1 – Student Pass/Fail Prediction](#task1)
    1. [Objective](#task1_objective)
    2. [Dataset Creation](#task1_dataset)
    3. [Data Exploration](#task1_explore)
    4. [Feature Selection and Train–Test Split](#task1_split)
    5. [Model Training – Logistic Regression](#task1_model)
    6. [Model Evaluation](#task1_evaluation)
    7. [Insights](#task1_insights)
3. [Task 2 – Sentiment Analysis with NLP](#task2)
    1. [Objective](#task2_objective)
    2. [Dataset Creation / Loading](#task2_dataset)
    3. [Text Preprocessing](#task2_preprocess)
    4. [Text Vectorization (TF‑IDF)](#task2_vectorize)
    5. [Model Training – Logistic Regression](#task2_model)
    6. [Model Evaluation](#task2_evaluation)
    7. [Insights and Examples](#task2_insights)


## 1. Setup and Common Imports <a id='setup'></a>

Here we import the libraries that will be used in both classification tasks.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

sns.set(style="whitegrid", context="notebook")

---
## 2. Task 1 – Student Pass/Fail Prediction <a id='task1'></a>


### 2.1 Objective <a id='task1_objective'></a>

Predict whether a student will **pass (1)** or **fail (0)** using:

- **Study Hours**: Number of hours the student studies per week
- **Attendance**: Percentage of classes attended

We will use a **Logistic Regression** classification model.

### 2.2 Dataset Creation <a id='task1_dataset'></a>

For demonstration, we create a small synthetic dataset that follows a realistic pattern:
- Students with **low study hours and low attendance** tend to fail.
- Students with **high study hours and high attendance** tend to pass.

In [None]:
# Create a synthetic dataset for student performance
data_student = {
    'Study_Hours': [5, 8, 10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35, 38, 40,
                    6, 9, 11, 14, 16, 19, 21, 23, 26, 29, 31, 33, 36, 39, 7],
    'Attendance': [40, 50, 55, 60, 65, 70, 75, 78, 80, 82, 85, 88, 90, 92, 95,
                   42, 52, 58, 63, 68, 72, 76, 79, 83, 86, 89, 91, 93, 96, 45]
}

df_student = pd.DataFrame(data_student)

# Define a simple rule to label pass/fail for illustration
# If (Study_Hours >= 15 and Attendance >= 70) -> Pass (1), else Fail (0)
df_student['Pass'] = ((df_student['Study_Hours'] >= 15) & (df_student['Attendance'] >= 70)).astype(int)

df_student.head()

### 2.3 Data Exploration <a id='task1_explore'></a>

We will:
- Check for missing values
- View summary statistics
- Visualize the relationship between features and target

In [None]:
# Check for missing values
df_student.isnull().sum()

In [None]:
# Summary statistics
df_student.describe()

In [None]:
# Visualize Study Hours vs Attendance colored by Pass/Fail
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_student, x='Study_Hours', y='Attendance', hue='Pass', s=80)
plt.title('Study Hours vs Attendance (Colored by Pass/Fail)')
plt.xlabel('Study Hours per Week')
plt.ylabel('Attendance (%)')
plt.legend(title='Pass')
plt.tight_layout()
plt.show()

### 2.4 Feature Selection and Train–Test Split <a id='task1_split'></a>

We use **Study_Hours** and **Attendance** as features (X) and **Pass** as the target (y).

In [None]:
# Feature matrix and target vector
X_student = df_student[['Study_Hours', 'Attendance']]
y_student = df_student['Pass']

# Train-test split (80% training, 20% testing)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X_student, y_student, test_size=0.2, random_state=42
)

X_train_s.shape, X_test_s.shape

### 2.5 Model Training – Logistic Regression <a id='task1_model'></a>

We train a **Logistic Regression** model on the training set.

In [None]:
# Initialize and train the Logistic Regression model
log_reg_student = LogisticRegression()
log_reg_student.fit(X_train_s, y_train_s)

# Predict on the test set
y_pred_s = log_reg_student.predict(X_test_s)
y_pred_s

### 2.6 Model Evaluation <a id='task1_evaluation'></a>

We evaluate the model using:
- **Accuracy**
- **Confusion Matrix**
- **Classification Report** (precision, recall, F1‑score)

In [None]:
# Accuracy
accuracy_s = accuracy_score(y_test_s, y_pred_s)
print(f"Accuracy: {accuracy_s:.2f}")

# Confusion matrix
cm_s = confusion_matrix(y_test_s, y_pred_s)
print("\nConfusion Matrix:\n", cm_s)

# Visualize confusion matrix
plt.figure(figsize=(4, 3))
sns.heatmap(cm_s, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted Fail', 'Predicted Pass'],
            yticklabels=['Actual Fail', 'Actual Pass'])
plt.title('Confusion Matrix – Student Pass/Fail')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Detailed classification report
print("\nClassification Report:\n")
print(classification_report(y_test_s, y_pred_s))

### 2.7 Insights <a id='task1_insights'></a>

- Students with **higher study hours** and **higher attendance** have a higher probability of passing.
- The model separates the passing and failing students reasonably well, as seen in the scatter plot.
- The confusion matrix and accuracy score indicate how well the model generalizes on unseen data.
- In a real project, we could further improve the model by collecting more data and testing other algorithms.

---
## 3. Task 2 – Sentiment Analysis with Natural Language Processing <a id='task2'></a>


### 3.1 Objective <a id='task2_objective'></a>

Analyze customer reviews and classify the **sentiment** as **positive** or **negative** using:

- Text preprocessing (cleaning, lowercasing, removing stopwords)
- **TF‑IDF vectorization** for converting text to numerical features
- **Logistic Regression** for classification
- Evaluation using **Accuracy, Precision, Recall, and F1‑Score**

### 3.2 Dataset Creation / Loading <a id='task2_dataset'></a>

In practice, you would load a dataset such as `reviews.csv` with columns:
- `Review_Text`
- `Sentiment`

Here, for demonstration, we create a small sample dataset inline.

In [None]:
# Sample customer reviews dataset
data_reviews = {
    'Review_Text': [
        "Amazing product, works great!",
        "Very disappointing, waste of money.",
        "I love it, highly recommended!",
        "Terrible experience, will not buy again.",
        "I am happy with the quality and service.",
        "Not worth buying, very poor performance.",
        "Excellent build and great performance.",
        "Really bad customer support.",
        "Good value for money.",
        "Worst product I have ever used."
    ],
    'Sentiment': [
        'positive', 'negative', 'positive', 'negative', 'positive',
        'negative', 'positive', 'negative', 'positive', 'negative'
    ]
}

df_reviews = pd.DataFrame(data_reviews)
df_reviews.head()

### 3.3 Text Preprocessing <a id='task2_preprocess'></a>

Preprocessing steps:
- Remove punctuation and special characters
- Convert text to lowercase
- Remove stopwords
- (Optionally) perform stemming or lemmatization


In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords (run once; comment out if already downloaded)
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def clean_text(text: str) -> str:
    """Clean the input text by removing non-letters, lowercasing, and removing stopwords."""
    # Keep only letters
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Lowercase
    text = text.lower()
    # Tokenize
    tokens = text.split()
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Join back to string
    return " ".join(tokens)

# Apply cleaning function
df_reviews['Clean_Review'] = df_reviews['Review_Text'].apply(clean_text)
df_reviews

### 3.4 Text Vectorization (TF‑IDF) <a id='task2_vectorize'></a>

We convert the cleaned text into numerical form using **TF‑IDF (Term Frequency–Inverse Document Frequency)**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit and transform the clean reviews
X_reviews = tfidf.fit_transform(df_reviews['Clean_Review'])

# Encode sentiment labels: positive -> 1, negative -> 0
y_reviews = df_reviews['Sentiment'].map({'positive': 1, 'negative': 0})

X_reviews.shape, y_reviews.shape

### 3.5 Train–Test Split and Model Training <a id='task2_model'></a>

We split the data into training and testing sets and train a **Logistic Regression** classifier.

In [None]:
# Train-test split (80% training, 20% testing)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reviews, y_reviews, test_size=0.2, random_state=42
)

# Initialize and train the Logistic Regression model
log_reg_reviews = LogisticRegression(max_iter=1000)
log_reg_reviews.fit(X_train_r, y_train_r)

# Predictions
y_pred_r = log_reg_reviews.predict(X_test_r)
y_pred_r

### 3.6 Model Evaluation <a id='task2_evaluation'></a>

We evaluate the sentiment analysis model using:
- **Accuracy**
- **Precision**
- **Recall**
- **F1‑Score**

In [None]:
accuracy_r = accuracy_score(y_test_r, y_pred_r)
print(f"Accuracy: {accuracy_r:.2f}")

print("\nClassification Report:\n")
print(classification_report(y_test_r, y_pred_r, target_names=['negative', 'positive']))

### 3.7 Insights and Example Predictions <a id='task2_insights'></a>

- **Positive reviews** tend to contain words such as: *amazing, love, excellent, happy, good, value*.
- **Negative reviews** contain words like: *disappointing, waste, terrible, not, worst, bad*.
- TF‑IDF highlights words that are important for distinguishing between positive and negative classes.
- Logistic Regression works well as a simple and interpretable baseline model for sentiment analysis.

Below we test the model with a few custom review examples.

In [None]:
# Helper function to predict sentiment for new reviews
def predict_sentiment(review_text: str) -> str:
    clean = clean_text(review_text)
    vec = tfidf.transform([clean])
    pred = log_reg_reviews.predict(vec)[0]
    return 'positive' if pred == 1 else 'negative'

sample_reviews = [
    "The product quality is outstanding and I am very satisfied.",
    "This is a total waste of money and time.",
    "Average experience, nothing special but not too bad either."
]

for r in sample_reviews:
    print(f"Review: {r}\nPredicted Sentiment: {predict_sentiment(r)}\n")

---
## 4. Summary of Deliverables

### Task 1 – Student Pass/Fail Prediction
- ✅ **Classification Model**: Logistic Regression trained on Study Hours and Attendance
- ✅ **Evaluation Metrics**: Accuracy, Confusion Matrix, Classification Report
- ✅ **Insights**: Relationship between study habits, attendance, and passing probability

### Task 2 – Sentiment Analysis with NLP
- ✅ **Preprocessed Dataset**: Cleaned text with stopwords removed
- ✅ **Text Vectorization**: TF‑IDF features
- ✅ **Classification Model**: Logistic Regression for sentiment classification
- ✅ **Evaluation Metrics**: Accuracy, Precision, Recall, F1‑Score
- ✅ **Insights**: Common patterns in positive and negative reviews + example predictions

This notebook is ready to be used as a professional project submission and can be directly uploaded to GitHub.