**Assignment 2**

**Question 1**

1. **Perceptron**:
   - **Speed**: Generally fast, especially in its basic form.
   - **Strength**: Good for simple, linearly separable problems.
   - **Robustness**: Not very robust to non-linear data; sensitive to noisy data and outliers.
   - **Feature Type**: Works best with numerical features as it relies on linear combinations of features.
   - **Statistical Nature**: It's more geometric in nature, as it tries to find a separating hyperplane.
   - **Optimization Problem**: Yes, it minimizes misclassifications. The cost function is usually based on the distance of misclassified points from the decision boundary.

2. **Support Vector Machine (SVM)**:
   - **Speed**: Can be slow, especially for large datasets, due to the computation of support vectors.
   - **Strength**: Very effective for high-dimensional spaces and when there is a clear margin of separation.
   - **Robustness**: More robust than Perceptron, especially with kernel SVM that can handle non-linear data.
   - **Feature Type**: Like Perceptron, favors numerical features. Can handle non-linear relationships with kernel trick.
   - **Statistical Nature**: It's a margin-based classifier and can be viewed both geometrically and statistically.
   - **Optimization Problem**: Yes, it maximizes the margin between classes. The cost function involves maximizing this margin while penalizing misclassifications.

3. **Decision Tree**:
   - **Speed**: Fast for training, but prediction speed can vary depending on the tree depth.
   - **Strength**: Good for data with complex relationships and interactions between features.
   - **Robustness**: Can overfit, especially with noisy data and without proper pruning.
   - **Feature Type**: Can handle both numerical and categorical data effectively.
   - **Statistical Nature**: More heuristic-based, using algorithms like ID3, C4.5, etc., to build the tree.
   - **Optimization Problem**: Not in the traditional sense. It uses heuristics like information gain or Gini impurity to split nodes.

4. **Random Forest**:
   - **Speed**: Slower for training due to building multiple trees, but parallelizable. Fast for predictions.
   - **Strength**: Very powerful, as it combines multiple decision trees to improve performance.
   - **Robustness**: More robust than a single decision tree, as it reduces overfitting through ensemble learning.
   - **Feature Type**: Handles both numerical and categorical data well.
   - **Statistical Nature**: Statistical, as it is an ensemble method that relies on the law of large numbers.
   - **Optimization Problem**: Like decision trees, it's more heuristic-based, using the combined decisions of multiple trees.

**Which to try first on your dataset?**
- This depends on the nature of your dataset. If our data is high-dimensional and linearly separable, SVM might be a good start. For more complex relationships or a mix of feature types, a Random Forest could be more suitable. If er are dealing with a large dataset and require fast training, starting with a simpler model like Perceptron or Decision Tree might be advantageous. Usually, starting with simpler models and then moving to more complex ones like Random Forest is a pragmatic approach.

**Question 2**

1. **Numerical**:
   - **Definition**: Features that are measured on a numeric scale. They can be either discrete (countable items) or continuous (measurable quantities).
   - **Example**: From the Iris dataset, the feature 'Petal Length' is a numerical feature. It is measured in centimeters and is a continuous variable. Example values might be 1.4, 4.5, 5.1 cm, etc.

2. **Nominal**:
   - **Definition**: Features that represent categories without any intrinsic ordering. They are also known as categorical variables.
   - **Example**: In a dataset about cars, the feature 'Car Brand' could be nominal. Example values might be 'Toyota', 'Ford', 'Honda', etc.

3. **Date**:
   - **Definition**: Features that represent dates or times.
   - **Example**: In a sales dataset, the feature 'Transaction Date' would be a date feature. Example values might be '2023-01-15', '2023-02-20', etc.

4. **Text**:
   - **Definition**: Features that contain textual data. This type of data is unstructured and often requires special processing for analysis.
   - **Example**: In a dataset of customer reviews, the feature 'Review Text' is a text feature. Example values might be "Great product, highly recommend!" or "Poor quality, arrived late."

5. **Image**:
   - **Definition**: Features that are composed of image data, typically stored in formats like JPEG, PNG, etc.
   - **Example**: In a dataset for a facial recognition system, the feature might be 'Face Image'. The values would be the actual images of individuals' faces.

6. **Dependent Variable**:
   - **Definition**: The variable that you are trying to predict or explain in your dataset. It's the outcome or the variable whose variation you're interested in.
   - **Example**: In the Iris dataset, the dependent variable could be 'Species' which is the type of iris plant (Setosa, Versicolour, Virginica). This is what you're trying to predict based on other features like petal length, petal width, etc.

**Question 3**

Several other metrics are commonly used to evaluate classifier performance. Each of these metrics offers different insights into the strengths and weaknesses of a classification model.

1. **Precision**: This metric indicates the proportion of positive identifications that were actually correct. It is calculated as the number of true positives (TP) divided by the total number of positive predictions (TP + false positives (FP)). Precision is particularly important in scenarios where the cost of false positives is high.

2. **Recall (Sensitivity or True Positive Rate)**: Recall measures how many of the actual positive cases were correctly identified by the classifier. It is calculated as TP divided by the total number of actual positives (TP + false negatives (FN)). This metric is crucial in situations where missing a positive instance is costly, such as medical diagnoses.

3. **F1 Score**: The F1 score combines precision and recall into a single metric by taking their harmonic mean. It provides a balance between precision and recall, making it useful in scenarios where both false positives and false negatives are important.

4. **AUC-ROC (Area Under the Receiver Operating Characteristics Curve)**: This metric is used to evaluate the performance of a binary classifier. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The AUC represents a classifier's ability to distinguish between classes, with higher values indicating better performance.

5. **Log Loss (Logistic Loss or Cross-Entropy Loss)**: This is used to assess the performance of classification problems, particularly in probabilistic classifiers. It penalizes false classifications by considering the uncertainty of the prediction based on the probability estimate.

Understanding these metrics helps in evaluating classifier performance more comprehensively, especially in scenarios where accuracy alone might be misleading, such as with imbalanced datasets. The choice of metric often depends on the specific needs and context of the classification task at hand. 

In [1]:
# Question 4 

import pandas as pd

# Function to calculate the mean of a list
def mean(values):
    return sum(values) / len(values)

# Function to calculate covariance between two lists
def covariance(x, y):
    mean_x, mean_y = mean(x), mean(y)
    covar = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(len(x))) / (len(x) - 1)
    return covar

# Function to calculate the standard deviation of a list
def standard_deviation(values):
    mean_value = mean(values)
    return (sum((x - mean_value) ** 2 for x in values) / (len(values) - 1)) ** 0.5

# Function to calculate correlation between two lists
def correlation(x, y):
    stddev_x = standard_deviation(x)
    stddev_y = standard_deviation(y)
    return covariance(x, y) / (stddev_x * stddev_y)

# Load the dataset
df = pd.read_csv('datasets/Admission_Predict.csv')

# Exclude 'Serial No.' if present
if 'Serial No.' in df.columns:
    df = df.drop(columns=['Serial No.'])

# Initialize an empty DataFrame for correlation matrix
correlation_matrix = pd.DataFrame(index=df.columns, columns=df.columns)

# Calculate correlations
for column1 in df.columns:
    for column2 in df.columns:
        x = df[column1].tolist()
        y = df[column2].tolist()
        correlation_matrix.at[column1, column2] = correlation(x, y)

correlation_matrix

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


In [2]:
# Now let's verify the correctness of the from-scratch correlation matrix using DataFrame.corr() method
verification_corr_matrix = df.corr()

verification_corr_matrix

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
GRE Score,1.0,0.835977,0.668976,0.612831,0.557555,0.83306,0.580391,0.80261
TOEFL Score,0.835977,1.0,0.69559,0.657981,0.567721,0.828417,0.489858,0.791594
University Rating,0.668976,0.69559,1.0,0.734523,0.660123,0.746479,0.447783,0.71125
SOP,0.612831,0.657981,0.734523,1.0,0.729593,0.718144,0.444029,0.675732
LOR,0.557555,0.567721,0.660123,0.729593,1.0,0.670211,0.396859,0.669889
CGPA,0.83306,0.828417,0.746479,0.718144,0.670211,1.0,0.521654,0.873289
Research,0.580391,0.489858,0.447783,0.444029,0.396859,0.521654,1.0,0.553202
Chance of Admit,0.80261,0.791594,0.71125,0.675732,0.669889,0.873289,0.553202,1.0


**Question 4**

1. **Should we use 'Serial No.'?**
   - No, 'Serial No.' should not be used. It is typically an arbitrary identifier assigned to each record and does not have any predictive or correlational relationship with the outcome variable. Including it could distort statistical analyses and predictive modeling because it does not contain meaningful information that contributes to the 'Chance of Admit'.

2. **Why does the diagonal of the matrix have all 1's?**
   - The diagonal of a correlation matrix represents the correlation of each variable with itself. Since any variable will always have a perfect linear relationship with itself, the correlation coefficient is 1.

3. **Correlations between all the variables:**
   - The correlations between the variables and the target variable 'Chance of Admit' indicate how each feature is related to the chances of admission. Features with higher correlation coefficients (closer to 1 or -1) have a stronger linear relationship with the target variable. Positive values indicate that as the feature increases, the chance of admit tends to increase, while negative values would indicate the opposite.

4. **Most important variable to predict 'Chance of Admit':**
   - Based on the provided correlation matrix, the 'CGPA' has the highest correlation coefficient (0.873289) with the 'Chance of Admit'. This suggests that 'CGPA' is the most important variable when it comes to predicting the chance of admission and would likely be a significant predictor in a model designed to predict admission chances. Variables like 'GRE Score' and 'TOEFL Score' also have relatively high correlations, which suggests they are also important, but 'CGPA' stands out as the most correlated feature.