<center><img src="https://gitlab.com/accredian/insaid-data/-/raw/main/Logo-Accredian/Case-Study-Cropped.png" width= 30% /></center>

# <center><b>Model Evaluation Techniques Assignment (Problem)<b></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Data Post-Processing**](#Section7)<br>
**8.** [**Model Development & Evaluation**](#Section8)<br>
**9.** [**Conclusion**](#Section9)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- **Evaluating** a machine learning model is as **important** as **building** it.

- We are creating models to perform on **new, previously unseen data**.

- Hence, a **thorough** and **versatile evaluation** is required to create a **robust** model.

- When it comes to **classification models**, evaluation process gets somewhat tricky.

- The various evaluation metrics that will be used are:

  - **Accuracy:** It is a metric that calculates the number of correct predictions divided by the total number of predictions.

  - **Precision:** It measures how good our model is when the prediction is positive.

  - **Recall:** It measures how good our model is at correctly predicting positive classes.

  - **F1-Score:** It is the weighted average of precision and recall.

  - **ROC Curve:** It summarizes the performance of the model at different threshold values.

  - **Precision Recall Curve:** It shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision.




---
<a name = Section2></a>
# **2. Problem Statement**
---

- A condition in which the tissues in the **kidney** become **inflamed** and have problems filtering waste from the blood.

- Nephritis may be caused by **infection**, **inflammatory** conditions, certain **genetic conditions**, and other diseases or conditions.

- XYB Diagnostics are **renowned** for their **expertise** in diagnosing nephritis.

<center><img src="https://us.123rf.com/450wm/alkov/alkov1808/alkov180800012/112010269-illustration-of-the-accute-pyelonephritis-with-the-pus-inside-the-kidney-and-severe-inflammation-nor.jpg?ver=6" width=30%></center>

- They want to go a step ahead and **automate** the process of **detecting nephritis** depending on various criteria.

- For this automation, they have hired a data scientist. Let's say it is you.

- You are provided with a **historical data of patients** who were suffered from nephritis and some patients who showed similar symptoms to nephritis.

- Your task is to **create a model** based on this data so that it can be used in real time to **determine** if a patient is **suffering from nephritis**.

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data
!pip install -q yellowbrick                                         # Toolbox for Measuring Machine Performance

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.0/353.0 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.5/679.5 kB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m455.4/455.4 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m127.2 MB/s[

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade yellowbrick

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To split the data into train and test datasets
from sklearn.linear_model import LogisticRegression                 # To instantiate a Logistic Regression Model
from sklearn.metrics import accuracy_score                          # To calculate the accuracy of a classifier
from sklearn.metrics import precision_score                         # To calculate the precision of a classifier
from sklearn.metrics import recall_score                            # To calculate the recall of a classifier
from sklearn.metrics import f1_score                                # To calculate the f1-score of a classifier
from sklearn.metrics import precision_recall_curve                  # To plot the precision-recall curve of a classifier
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- The data was created by a **medical expert** as a data set to **help build a system** which will perform the presumptive **diagnosis of nephritis**.

</br>

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 120 | 8 | 4 KB|

</br>

| Id | Features | Description |
| :-- | :-- | :-- |
| 01 | **temperature** | Temperature of patient |
| 02 | **nausea** | Occurrence of nausea |
| 03 | **lumbar_pain** | Muscle strain is often the cause of back pain from heavy lifting or vigorous exercise |
| 04 | **urine_pushing** | Urine pushing (continuous need for urination) |
| 05 | **micturition_pain** | Pain while urinating |
| 06 | **burning** | Burning of urethra, itch, swelling of urethra outlet |
| 07 | **inflamation** | Inflammation of urinary bladder |
| 08 | **nephritis** | Nephritis of renal pelvis origin |

In [None]:
data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/diagnosis.csv', delimiter='\t')
print('Data Shape:', data.shape)
data.head()

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
data.describe()

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [None]:
data.info()

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
profile = ProfileReport(df=data)
profile.to_file(output_file='Pre Profiling Report.html')
print('Accomplished!')

**Performing Operations**


---
**<h4>Question 1:** Create a function that performs the following cleaning operations on the dataset:</h4>

---

- Removes the whitespaces from column names.

- Removes the duplicate rows from the dataset.

- Maps 1 for 'yes' and 0 for 'no' for all categorical variables.

<details>

**<summary>Hint:</summary>**

- You can use `.str.replace(' ', '')` to remove whitespaces from the column names.

- You can use `.drop_duplicates` method to remove the duplicates.

- You can use `.map` method for the required changes.

</details>

In [None]:
def clean_data(data=None):
  # Write your code here...

In [None]:
clean_data(data=data)
data.head()

<a name = Section52></a>
### **5.2 Post Profiling Report**

- Since we only mapped some of the features and removed duplicate rows from the dataset, we won't apply profiling to the dataset again.

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---


---
**<h4>Question 2:** Create a function that checks patients experienced burning sensation and were diagnosed with nephritis.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inch figure

- You can use `sns.counplot` method on burning feature and keep nephritis as hue.

- Add cosmetics like grid and title

- Keep the tick size as 12, label size as 14 and title size as 16.

</details>

In [None]:
def burning(data=None):
  # Write your code here...

In [None]:
burning(data=data)


---
**<h4>Question 3:** Create a function that checks relation between nausea and body temperature.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inch figure

- You can use `sns.histplot` method on temperature feature keeping nausea as hue.

- Add cosmetics like grid and title

- Keep the tick size as 12, label size as 14 and title size as 16.

</details>

In [None]:
def nausea_n_temp(data=None):
  # Write your code here...

In [None]:
nausea_n_temp(data=data)


---
**<h4>Question 4:** Create a function that checks relation between nephritis and body temperature.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inch figure

- You can use `sns.kdeplot` method on temperature feature keeping nephritis as hue.

- Add cosmetics like grid and title

- Keep the tick size as 12, label size as 14 and title size as 16.

</details>

In [None]:
def niphritis_n_temp(data=None):
  # Write your code here...

In [None]:
niphritis_n_temp(data=data)


---
**<h4>Question 5:** Create a function that checks relation between nephritis, nausea and inflammation.</h4>

---

<details>

**<summary>Hint:</summary>**



- Create a 15x7 inch figure

- You can use `sns.countplot` method on `inflammation` feature keeping hue as `nephritis` and keep `data=data[data['nausea']==0]` or `data=data[data['nausea']==1]`.

- Add cosmetics like grid and title

- Keep the tick size as 12, label size as 14 and title size as 16.

</details>

In [None]:
def nausea_n_nephritis(data=None):
  # Write your code here...

In [None]:
nausea_n_nephritis(data=data[data['nausea']==0])

<a name = Section7></a>

---
# **7. Data Post-Processing**
---

<a name = Section71></a>
### **7.1 Feature Extraction**

- In this section, we will extract the important features and seperate the independent and dependent variables.

---
**<h4>Question 6:** Create a function that creates two dataframes for dependent and independent features.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create input dataframe X by dropping only "nephritis" feature from axis 1.

- Create target series by using "nephritis" as value.

</details>


In [None]:
def seperate_Xy(data=None):
  # Write your code here...

In [None]:
X, y = seperate_Xy(data=data)

<a name = Section72></a>
### **7.2 Data Preparation**

- Now we will **split** our **data** in **training** and **testing** part for further development.

---
**<h4>Question 7:** Create a function that splits the data into train and test datasets while keeping random state as 42.</h4>

---

<details>

**<summary>Hint:</summary>**

- Use `train_test_split()` to split the dataset.

- Use `test_size` of **0.30**

- Use `random_state` equal to **42**.

- **Stratify** the target variable.

</details>

In [None]:
def Xy_splitter(X=None, y=None):
  # Write your code here...

In [None]:
X_train, X_test, y_train, y_test = Xy_splitter(X=X, y=y)

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will develop a Logistic Regression model, check it's performance using different metrics.

<a name = Section81></a>
### **8.1 Model Development & Evaluation**

---
**<h4>Question 8:** Create a function that instantiates a logistic regression model, fits the model on train set, makes predictions on test set and returns those predictions.</h4>

---

<details>

**<summary>Hint:</summary>**

- Instantiate a logistic regression model using LogisticRegression().

- Use `class_weight = 'balanced'`.

- `Fit` the model on training set.

- `Predict` the values on the train set and the test set.

</details>

In [None]:
def train_n_eval():
  # Write your code here...

In [None]:
y_pred_train, y_pred, y_test_pred_proba, y_train_pred_proba = train_n_eval()


---
**<h4>Question 9:** Create a function that evaluates the model's training and testing predictions on the given metrics:</h4>

---

- Accuracy score
- Precision score
- Recall score
- F1 score



<details>

**<summary>Hint:</summary>**

- Evaluate the predictions using the `accuracy_score`, `precision_score`, `recall_score` and `f1_score` on the train set and the test set.

</details>

In [None]:
def calculate_metrics(y_pred=None, y_pred_train=None):
  # Write your code here...

In [None]:
calculate_metrics(y_pred=y_pred, y_pred_train=y_pred_train)


---
**<h4>Question 10:** Create a function that plots the Precision-Recall curve for the predictions on train and test data.</h4>

---

<details>

**<summary>Hint:</summary>**

- For plot_precision_recall_builder():

  - Calculate the precision and recall values at various thresholds using `precision_recall_curve()` method.

  - Calculate the average precision and recall values using `np.mean()`

  - Plot the curve using  `sns.lineplot()` and plot the average precision and recall values with respect to `[0, 1]`.

  - Add some cosmetics like title, grid and legend.

  - Keep label size as 14 and title size as 16.

- For plot_precision_recall():

  - Create 2 subplots and call the builder function seperately for train predictions and test predictions.

  - Add some more cosmetics like super title.

</details>

In [None]:
def plot_precision_recall_builder(y_true, y_pred, train_or_test):
  '''
  y_true: Acutal values of the target
  y_pred: Predicted values of the target. Either predict_proba or decision_function
  line_show: Plot average values "precision" or "recall"
  train_or_test: Train Data or Test Data
  '''
  # Write your code here...

In [None]:
def plot_precision_recall():
  # Write your code here...

In [None]:
plot_precision_recall()

<a name = Section9></a>

---
# **9. Conclusion**
---

- We have seen that **inflammation** is **not relevant** to **nephritis** according to our data.

- Body **temperature** and **nausea** play an **important role** in determining if a patient suffers from **nephritis**.

- We have also **developed** a model and successfully **tested** it on various evaluation metrics.

- Although based on **patient's temperature** and **responses**, we can **predict** if the patient suffers from nephritis or not.

- If we can get a **better snapshot of data** and more features that relate to nephritis, we can train a model ready for **real-world information**.