Reg Num:  _______________________

---
$$\color{orange}{AML\ 5152\,\lvert\, Applied\ Machine\ Learning\,\lvert\,Lab\ Final\,\lvert\,Odd\ Semester\ 2023}$$
---

**Instructions:**
1. Fill the reg num at the top of this notebook
2. When a code template is provided, you have to fill the code template. You should not replace the code template with different code from elsewhere
3. Upload your Jupyter notebook with all its outputs intact here: https://tinyurl.com/tckf29w4
4. Do not solicit inputs from others. Plagiarism check will be performed after the exam
5. You will be orally asked why you did each design choice along the way. If you cannot defend your choice, then some marks will be deducted

### **Problem Statement**

**PS: DO NOT FORGET TO TRAIN-TEST SPLIT AT AN APPROPRIATE TIME IN THE ENTIRE FLOW**.

You decide where you want to position the train test split in the stages below

#### I. Create a pandas dataset for apples satisfying the following constraints:
1. Datset should have six columns - weight, volume, pesticide per apple, discoloration, machine_plucked and apple_type and 100 records with 70 Gala apples, 15 Fuji apples and 15 Red apples
2. Apple type is a target variable. Remaining are predictor variables. 
3. There are 3 apple types: Gala, Fuji, Red. 
4. Machine Plucked is either yes or no.
5. Simulate the data randomly such that subsequent simulations produce very similar or identical data 
6. Simulate the data for the 3 types of apples according to the following rules:
	* Gala apple weights are normally distributed with mean of 155 g and a standard deviation of 5 g.
	* Fuji apple weights are uniformly distributed between 200 and 250 gram
	* Red apple weights are distributed according to triangular distribution with minimum weight of 100g, maximum weight of 190 g and the most frequent weight being 170g
7. Apple volumes are normally distributed for the three apple types Gala, Fuji and Red with a mean of 187cc, 270cc and 150cc respectively and variance of 25cc respectively
8. A pesticide Quinalphos was dissolved in water and sprayed at the rate of 500 gm per 100 apples. The spray was unequal and had a variance of 4 milligram per apple. This pestiside dosage is common to all three apple types. This data will be used to populate "pesticide per apple" feature
9. Discoloration of the apple is equal to the percentile of the pesticide per apple

#### II. Introduce NaNs
1. Randomly introduce NaN for weight and volume feature for 25% of the records such that the **fraction of NaN for each apple type is proportional to the ratio of samples**.
2. Pesticide_per_apple data should be randomly nulled out for data beyond 75th percentile
3. Randomly introduce NaN for machine plucked apples for 5% of records
4. Randomly introduce NaN for apple type for 10% of records

#### III Transform, Train/Test Split and Impute
Ask yourself these questions and do accordingly: 
1. Will you do train test split before or after doing train test split? 
2. Will you do transformation after imputation or before? 
3. Will you do split before transformation?

According to your choice do these three in the order you deem fit

1. Impute the data for relevant columns using an appropriate imputation method fit for each scenario
2. If there are any records that you feel should be deleted, then please do so
3. Do a train test split 80:20 such that the fraction of NaN for each apple type is proportional to the ratio of samples of that apple type
4. Do any other data transformation you feel is needed

#### IV Feature Elimination and Feature Selection
1. If there are any features that you can immediately drop without any exploration, programming then please do so first
2. Check which features have highest predictive power wrt target variable
3. Check features on which target is dependent. Use a mechanism that is different from previous method for this.
4. Base on the above two checks, choose 2 features for predicting apple type

#### V ML Prediction
1. Apply Logistic Regression to predict apple type
2. Choose a  metric that you think is most suitable for this scenario

In [3]:
from math import sqrt

import numpy as np
import pandas as pd

np.random.seed(42)

--- 

#### I. Create a pandas dataset for apples satisfying the following constraints:
1. Datset should have six columns - weight, volume, pesticide per apple, discoloration, machine_plucked and apple_type and 200 records with 70 Gala apples, 15 Fuji apples and 15 Red apples
2. Apple type is a target variable. Remaining are predictor variables. 
3. There are 3 apple types: Gala, Fuji, Red.

In [4]:
total_records = 100
apple_types = ["Gala", "Fuji", "Red"]
apple_ratios = (0.7, 0.15, 0.15)

4. Machine Plucked is either yes or no.

In [5]:
machine_plucked = np.random.choice(['Yes', 'No'], total_records, p=[0.5, 0.5])

**Simulate Gala apples**

1. Gala apple weights are normally distributed with mean of 155 g and a standard deviation of 5 g.
2. Gala apple volumes are normally distributed with mean of 187 cc and variance of 25 cc $^2$

**Note: Check if the np.random functions accept standard deviation or variance as arguments and accordingly adjust**  

In [6]:
from enum import Enum
class Gala(Enum):
    apple_ratio = 0.7
    
    volume_avg = 187
    volume_variance = 25

    weight_mean = 155
    weight_standard_dev = 5

In [7]:
gala_weights = np.random.?(Gala.weight_mean.value, 
                                Gala.?.?, 
                                Gala.?.?*total_records)

gala_volumes = np.random.?(size=int(Gala.?.? * ?), 
                                scale=sqrt(Gala.?.?),
                                loc=Gala.?.?)

**Simulate Fuji apples**

1. Fuji apple weights are uniformly distributed between 200 and 250 gram
2. Fuji apple volumes are normally distributed with a mean os 270 cc and variance of 25 cc $^2$

In [8]:
fuji_dict = {
    "apple_ratio": 0.15,
    "volume_mean": 270,
    "volume_variance": 25,
    "weight_high": 250,
    "weight_low": 200
}

In [9]:
fuji_weights = np.random.?(size=int(fuji_dict[?] * ?),
                                 high=fuji_dict[?],
                                 low=fuji_dict[?])

fuji_volumes = np.random.?(size=int(fuji_dict[?] * ?), 
                                scale=sqrt(fuji_dict[?]),
                                loc=?["volume_mean"])

**Simulate Red Apples**


1. Red apple weights are distributed according to triangular distribution with minimum weight of 100g, maximum weight of 190 g and the most frequent weight being 170g
2. Red apple volumes are normally distributed with mean of 150cc and standard deviation of 5cc

In [10]:
red_weights = np.random.triangular(100, 170, 190, 15)
red_volumes = np.random.?(size=int(0.15*total_records), scale=150, loc=5)

8. A pesticide Quinalphos was dissolved in water and sprayed at the rate of 500 gm per 100 apples. The spray was unequal and had a variance of 4 milligram per apple. This pestiside dosage is common to all three apple types. This data will be used to populate "pesticide per apple" feature

In [12]:
#pesticide_per_apple = np.random.normal(50/total_records, 3, total_records) 

from scipy import stats
pesticide_per_apple_distribution = stats.norm(
    loc=500/total_records, 
    scale=?
)
pesticide_per_apple = pesticide_per_apple_distribution.rvs(size=total_records)
print(pesticide_per_apple)


[7.90228722 6.91854165 9.30636492 3.46530487 6.74464127 5.36668401
 9.37960587 3.38340343 3.32055632 3.80121471 0.75220855 3.94848996
 3.48173468 5.30078757 5.68351195 8.75234168 6.90084768 3.84619269
 3.20317066 5.98383834 2.35953359 8.66291753 7.35888024 4.0616487
 1.57373094 7.70774475 4.77092031 7.47563262 1.81114468 3.80124995
 5.0104874  5.09396119 4.09986906 6.24569986 2.86475914 4.71524103
 5.24059126 6.02887767 6.42322976 2.75071582 1.93177166 7.55535364
 5.66462802 3.50302693 8.10230395 5.23134927 7.35859437 5.13503696
 9.12149585 8.51068168 4.5020717  6.9431419  6.2907519  7.73726312
 3.07015308 6.37210292 7.11684897 1.48252103 2.63348297 0.92153564
 4.46118633 6.43508451 8.0047141  5.14818956 8.25723109 2.23979708
 1.59323512 4.8889046  5.7681309  4.9346105  0.8651158  4.82175992
 2.391061   6.3393451  5.73319649 3.12024043 3.97226617 2.88157296
 4.87464181 6.91028464 3.02854791 6.00809303 3.93948476 3.41425434
 4.78593928 2.92951536 3.89270139 2.60424421 8.92945027 5.07052

In [13]:
discoloration = [pesticide_per_apple_distribution.cdf(rec) for rec in pesticide_per_apple]
print(discoloration)

[0.926630062105628, 0.8312888358220992, 0.9843478290983501, 0.2214374482690769, 0.8084832587081733, 0.5727351614315366, 0.9857307340373052, 0.20945944468049693, 0.2005321821933025, 0.2744555400987607, 0.016839432006494345, 0.29952921459875403, 0.22388660138904837, 0.5597730284041454, 0.6337327284952862, 0.9696840875471561, 0.8290515319423867, 0.282002270471181, 0.18448225881795993, 0.6886117576099438, 0.09337858409518196, 0.966483951563132, 0.8808885164240854, 0.3194720437483427, 0.043343896480428155, 0.9121114507246151, 0.45440493113897107, 0.8921079106663193, 0.05542014078510172, 0.2744614145163289, 0.502091923936583, 0.5187356528099304, 0.32633161640210984, 0.7333084289311411, 0.1428458822254438, 0.4433901329369875, 0.547875517812965, 0.6965273928722624, 0.761648354314821, 0.1303703647849549, 0.06250078385638944, 0.8993182995033568, 0.6301739197437303, 0.22708337146196406, 0.9395673660474344, 0.5460447949114474, 0.8808600704160591, 0.5269155252202345, 0.9803364525519929, 0.960399487

In [14]:
# Convert the discoloration into percentage and only integer precision



**Concatenate all features and target variables and make the dataframe**

1. Be sure to line up the simulated records per the apple type and concatenate
2. Display the final dataframe 

---

#### II. Introduce NaNs
1. Randomly introduce NaN for weight and volume feature for 25% of the records such that the **fraction of NaN for each apple type is proportional to the ratio of samples**.
2. Pesticide_per_apple data should be randomly nulled out for data beyond 75th percentile. Use the discoloration data (which is nothing but the percentile) to make this determination
3. Randomly introduce NaN for machine plucked apples for 5% of records
4. Randomly introduce NaN for apple type for 10% of records

---

#### III Transform, Train/Test Split and Impute
Ask yourself these questions and do accordingly: 
1. Will you do train test split before or after doing train test split? 
2. Will you do transformation after imputation or before? 
3. Will you do split before transformation?

According to your choice do these three in the order you deem fit

1. Impute the data for relevant columns using an appropriate imputation method fit for each scenario
2. If there are any records that you feel should be deleted, then please do so
3. Do a train test split 80:20 such that the fraction of NaN for each apple type is proportional to the ratio of samples of that apple type
4. Do any other data transformation you feel is needed

---

#### IV Feature Elimination and Feature Selection
1. If there are any features that you can immediately drop without any exploration, programming then please do so first
2. Check which features have highest predictive power wrt target variable
3. Check features on which target is dependent. Use a mechanism that is different from previous method for this.
4. Base on the above two checks, choose 2 features for predicting apple type


---

#### V ML Prediction
1. Apply Logistic Regression to predict apple type
2. Choose a  metric that you think is most suitable for this scenario