# Assignment  
## submitted by Varun Yadav - zvarun747@gmail.com


# Objective: 
### To showcase my data science, machine learning, teaching, and problem-solving skills.

## ----------------------------------------------------------------------------------------------------------------------------------

## Question 1

1. Data Science and Machine Learning
    - Develop a machine learning model in a Jupyter notebook using a dataset of your choice. This should include data preprocessing, model training, and model evaluation.


### answer


 To demonstrate the process of developing a machine learning model, let's work on a classic classification problem using the well-known Iris dataset. The task is to classify iris flowers into three different species based on their sepal and petal measurements.

Let's start by importing the necessary libraries and loading the dataset.


In [97]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
iris_data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
iris_data.head()



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa



The dataset is now loaded into the `iris_data` DataFrame, and we can observe its structure and content.

Next, let's preprocess the data by splitting it into features (X) and the target variable (y), and further splitting them into training and testing sets.


## data preprocessing

In [98]:

# Split the dataset into features (X) and target variable (y)
X = iris_data.drop("species", axis=1)
y = iris_data["species"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




Now that we have our data split, we can proceed with the model training. In this example, we'll use logistic regression as our machine learning algorithm.


# --------------------------------------------------------------------
## let's understand more about Logistic Regression


# Logistic Regression

Logistic regression is a popular statistical algorithm used for binary classification problems. It is named "regression" but is primarily used for classification tasks. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an instance belonging to a particular class.

## Working Principle

The underlying principle of logistic regression is based on the logistic function, also known as the sigmoid function. The sigmoid function maps any real-valued number to a value between 0 and 1. It is defined as:

![Sigmoid Function](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/400px-Logistic-curve.svg.png)

The logistic regression model estimates the parameters (weights and biases) that best fit the sigmoid function to the given input features. It applies a linear transformation to the features and passes the result through the sigmoid function to obtain the predicted probability.

## Model Representation

In logistic regression, the output of the model is the probability of an instance belonging to a specific class. The predicted probability is then converted into a binary prediction by applying a threshold. If the probability is above the threshold, the instance is classified as one class; otherwise, it is classified as the other class.

Mathematically, the logistic regression model can be represented as:

```
z = w_1*x_1 + w_2*x_2 + ... + w_n*x_n + b
p = sigmoid(z)
```

where:
- `z` is the linear combination of the feature values (`x_i`) weighted by the corresponding parameters (`w_i`) and added to the bias term (`b`).
- `p` is the predicted probability obtained by applying the sigmoid function to `z`.

## Model Training

The logistic regression model is trained using optimization algorithms like gradient descent to minimize a loss function. The most commonly used loss function in logistic regression is the binary cross-entropy loss. It measures the dissimilarity between the predicted probabilities and the true labels.

During training, the model adjusts the weights and biases to minimize the loss, aiming to make accurate predictions. The optimization process involves iteratively updating the parameters based on the gradients of the loss function with respect to the parameters.

## Model Evaluation

To evaluate the performance of a logistic regression model, various evaluation metrics can be used. Common metrics include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). These metrics assess the model's ability to correctly classify instances and measure the trade-off between true positives and false positives.

## Applications

Logistic regression is widely used in various domains, including:
- Disease diagnosis: Predicting the likelihood of a patient having a particular disease based on medical test results.
- Customer churn prediction: Determining the probability of customers leaving a service or subscription.
- Spam detection: Classifying emails as spam or not spam based on their content.
- Credit risk assessment: Evaluating the probability of default for loan applicants based on their financial information.

Logistic regression is a powerful and interpretable algorithm that forms the basis for more advanced techniques in machine learning. It provides a straightforward and effective approach for binary classification problems when the relationship between features and the target variable is not linear.

## -------------------------------------------------------------------------------

### Scaling the features And Model Training

In [99]:

# Perform feature scaling on the training and testing sets # model selection
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an instance of the logistic regression model
model = LogisticRegression()

# Train the model using the scaled training data
model.fit(X_train_scaled, y_train)



The model is trained using the scaled training data. Finally, let's evaluate the trained model using the testing data and calculate the accuracy score and classification report.


### Predicting the values


In [100]:

# Make predictions on the testing data
y_pred = model.predict(X_test_scaled)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0



The code above will output the accuracy score and the classification report, providing a detailed evaluation of the model's performance on the test data.


## Classification report

In [101]:
y_pred


array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa'], dtype=object)

In [102]:
y_test

73     Iris-versicolor
18         Iris-setosa
118     Iris-virginica
78     Iris-versicolor
76     Iris-versicolor
31         Iris-setosa
64     Iris-versicolor
141     Iris-virginica
68     Iris-versicolor
82     Iris-versicolor
110     Iris-virginica
12         Iris-setosa
36         Iris-setosa
9          Iris-setosa
19         Iris-setosa
56     Iris-versicolor
104     Iris-virginica
69     Iris-versicolor
55     Iris-versicolor
132     Iris-virginica
29         Iris-setosa
127     Iris-virginica
26         Iris-setosa
128     Iris-virginica
131     Iris-virginica
145     Iris-virginica
108     Iris-virginica
143     Iris-virginica
45         Iris-setosa
30         Iris-setosa
Name: species, dtype: object

In [103]:
y_pred= y_pred.tolist()


In [104]:
y_test=y_test.tolist()



In [105]:
# Comparing the first 5 outputs with test

print(y_test[:5],'\n',y_pred[:5])

['Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor'] 
 ['Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor']


In [106]:

# A Classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Classification Report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       1.00      1.00      1.00         9
 Iris-virginica       1.00      1.00      1.00        11

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30



#### Metrics used above

## Confusion matrix

```
|                  | Predicted Negative | Predicted Positive |
|------------------|--------------------|--------------------|
| Actual Negative  | True Negative (TN) | False Positive (FP)|
| Actual Positive  | False Negative (FN)| True Positive (TP) |
```

In the table, the rows represent the actual classes, while the columns represent the predicted classes. The elements in the table correspond to the number of instances falling into each category.

- *True Negative* (TN): Instances that are correctly predicted as negative (correctly classified as the negative class).
- *False Positive* (FP): Instances that are incorrectly predicted as positive (incorrectly classified as the positive class).
- *False Negative* (FN): Instances that are incorrectly predicted as negative (incorrectly classified as the negative class).
- *True Positive* (TP): Instances that are correctly predicted as positive (correctly classified as the positive class).

The confusion matrix provides a comprehensive overview of the model's performance by showing the distribution of predicted and actual labels. It is a valuable tool for evaluating the accuracy and effectiveness of a classification model.



**Precision**:

Precision is calculated using the following formula:

```
Precision = True Positives / (True Positives + False Positives)
```

Precision measures the proportion of correctly predicted positive instances out of the total predicted positive instances.

**Recall**:

Recall is calculated using the following formula:

```
Recall = True Positives / (True Positives + False Negatives)
```

Recall measures the proportion of correctly predicted positive instances out of the total actual positive instances.

**F1-score**:

The F1-score is calculated using the following formula, which is the harmonic mean of precision and recall:

```
F1-score = 2 * (Precision * Recall) / (Precision + Recall)
```

The F1-score provides a balanced measure of a model's performance, taking into account both precision and recall.

**Support**:

Support refers to the number of actual instances in each class in the dataset. It represents the number of occurrences of each class in the dataset.

These metrics are commonly used to evaluate the performance of classification models.

# ---------------   Thank You    ---------------------------