# CSCI 4455/5455 – Spring 2020 
### Assignment 1 - Know Your Data/Data Preprocessing
#### Your name: swayanshu shanti pragnya

In 2014, a new disease caused by "T-Virus" had spread over the world. A medical research center developed two different medicines, Anti-A and Anti-B. However, the researchers didn't know how to treat patients with the two medicines, and so decided to do some experiments by injecting different combinations of the two medicines to 40 infected patients for a month. After one month, the researchers measured the amount of t-virus in the patients' blood. Assumes all patients had 100% T-Virus before getting the treatments. 

In [1]:
import pandas as pd
import math
import warnings # for hiding warnings
warnings.filterwarnings('ignore')

In [2]:
base_path = r'C:\Users\Swayanshu' # path to the folder that contains the dataset.csv. leave it blank in your submission
dataset_name = 'dataset.csv'
dataset_path = rf'{base_path}/{dataset_name}'

Load the dataset and check the first 10 rows

In [3]:
# DO NOT EDIT THIS CELL
dataset = pd.read_csv(dataset_path)
dataset.head(10)

Unnamed: 0,anti_a,anti_b,t_virus
0,5.76,94.4,9.3
1,7.9,15.5,11.5
2,4.41,4.8,15.2
3,8.7,48.3,15.8
4,2.84,78.9,11.5
5,2.84,33.6,77.9
6,7.31,58.4,74.9
7,4.24,53.3,60.0
8,9.83,55.3,8.4
9,7.59,76.7,73.4


get the summary statistics

In [4]:
# DO NOT EDIT THIS CELL
dataset.describe()

Unnamed: 0,anti_a,anti_b,t_virus
count,40.0,40.0,40.0
mean,5.11625,49.5775,37.465
std,3.157834,31.436003,30.809885
min,0.17,1.6,6.0
25%,2.5825,19.65,11.275
50%,5.415,50.45,16.15
75%,7.5225,80.025,69.275
max,9.94,98.8,90.7


1. mean of anti_b is high compare to anti_a
2. data is normally distributed

and check the number of records

In [5]:
# DO NOT EDIT THIS CELL
len(dataset)

40

#### Question 1: Equal Width Partitioning (15 Points)
complete the below function

parameters:
- values (not sorted) as a list
- the number of bins as an integer

returns a tuple containing:
- bin width => as a float
- bounds for the bins => as a nested list. Each item (bounds[i]) contains the lower bound and the upper bound for the bins[i]
- list of bins => as a nested list. bins[i] is a list of sorted values for the i-th bin



Binning By Equal-Width = width=(max−min)/No of bins

In [6]:
def equal_width_partitioning(values, bin_count):
    bins = []
    bounds = []
    for i in range(bin_count):
        bins.append(list())
        bounds.append(list())
    width = 0
    #bin wi]dth from bin_count
    min_v = min(values)
    width = math.ceil((max(values) - min_v) / bin_count)  #getting the upper  max value
    temp = min_v
    for i in range(bin_count):
        bounds[i] = [temp, temp + width]   #bin
        temp = temp + width    #lower bound of next bin 
    for i in values:
        for j in range(0, bin_count):
            if bounds[j][0] == i and j == 0:
                bins[j].append(i)
            if bounds[j][0] < i <= bounds[j][1]:
                bins[j].append(i)
                break                    
    for i in range(bin_count):
        bins[i] = sorted(bins[i])
    
    return (width, bounds, bins)

In [8]:
# DO NOT EDIT THIS CELL
t_virus = list(dataset['t_virus']) # get the t-virus values
bin_count = 5 # set the bin count to 5
(width, bounds, bins) = equal_width_partitioning(t_virus, bin_count) # call the binning function

print(f"partition width: {width}") #print the results
for i in range(len(bins)):
    s = "[" if i==0 else "("
    bin_values = ', '.join(str(x) for x in bins[i])
    print(f"{s}{bounds[i][0]}, {bounds[i][1]}] => {bin_values}") 
    

partition width: 17
[6.0, 23.0] => 6.0, 6.7, 7.5, 7.6, 8.4, 9.3, 9.4, 9.6, 10.6, 11.2, 11.3, 11.5, 11.5, 11.9, 13.0, 13.8, 14.1, 14.9, 15.2, 15.8, 16.5, 17.7, 21.4
(23.0, 40.0] => 
(40.0, 57.0] => 55.5
(57.0, 74.0] => 60.0, 61.4, 62.1, 64.9, 64.9, 69.0, 70.1, 73.4
(74.0, 91.0] => 74.9, 76.8, 77.0, 77.9, 80.1, 81.2, 83.8, 90.7


#### Question 2: Equal-depth (equal-frequency) Partitioning (15 Points)
complete the below function

parameters:
- values (not sorted) => as a list
- the number of values in each bin (not the number of bins). The last bin can be unfull if there are not enough elements.

returns a tuple containing:
- bounds for the bins => as a nested list. Each item (bounds[i]) contains the lower bound and the upper bound for the bins[i]
- list of bins => as a nested list. bins[i] is a list of sorted values for the i-th bin

Equal depth (or frequency) binning : In equal-frequency binning we divide the range [A, B] of the variable into intervals that contain (approximately) equal number of points; equal frequency may not be possible due to repeated values.

In [9]:
def equal_depth_partitioning(values, bin_frequency):
    bins = []
    bounds = []
    bin_count = math.ceil(len(values)/ bin_frequency)
    for i in range(bin_count):
        bins.append(list())
        bounds.append(list())
    #sort values
    values = sorted(values)
    start = 0
    end = bin_frequency
    for i in range(bin_count):
        bins[i] = values[start:end]
        start += bin_frequency
        end += bin_frequency
    for i in range(bin_count):
        bounds[i] = [bins[i][0],bins[i][-1]]
    return (bounds, bins)

In [10]:
equal_depth_partitioning([11,1,11,1,10,5,6,7,8,9],2)

([[1, 1], [5, 6], [7, 8], [9, 10], [11, 11]],
 [[1, 1], [5, 6], [7, 8], [9, 10], [11, 11]])

In [11]:
# DO NOT EDIT THIS CELL
t_virus = list(dataset['t_virus'])
bin_frequency = 5
(bounds, bins) = equal_depth_partitioning(t_virus, bin_frequency)

for i in range(len(bins)):
    bin_values = ', '.join(str(x) for x in bins[i])
    print(f"[{bounds[i][0]}, {bounds[i][1]}] => {bin_values}")

[6.0, 8.4] => 6.0, 6.7, 7.5, 7.6, 8.4
[9.3, 11.2] => 9.3, 9.4, 9.6, 10.6, 11.2
[11.3, 13.0] => 11.3, 11.5, 11.5, 11.9, 13.0
[13.8, 15.8] => 13.8, 14.1, 14.9, 15.2, 15.8
[16.5, 60.0] => 16.5, 17.7, 21.4, 55.5, 60.0
[61.4, 69.0] => 61.4, 62.1, 64.9, 64.9, 69.0
[70.1, 77.0] => 70.1, 73.4, 74.9, 76.8, 77.0
[77.9, 90.7] => 77.9, 80.1, 81.2, 83.8, 90.7


#### Question 3: Grouping (15 Points)
Now, divide the 40 patients into two groups such that one group has much lower T-Virus than the other. To find the cut-off value, we need to find the largest gap between the two consecutive values. Complete the below function that returns the smallest value that should be assigned to the second group (you can use the results from the partitioning analysis above to verify your answer). 

inputs:
- values (not sorted) as a list

returns:
- the cut-off value

In [12]:
def find_first_value(values):
    cutoff = 0
    for i in range(len(values)-1):
        if abs(values[i] - values[i+1]) > cutoff:
            cutoff = abs(values[i] - values[i+1])
    return cutoff

In [13]:
# DO NOT EDIT THIS CELL
t_virus = list(dataset['t_virus'])
cut_off = find_first_value(t_virus)
print(f"first value of the second group should be: {cut_off}")
group1 = dataset[dataset['t_virus'] < cut_off]
group2 = dataset[dataset['t_virus'] >= cut_off]

first value of the second group should be: 76.3


#### Question 4: Measure of Central Tendency  (10 Points):
complete the below functions to compute the mean and median of for a given unsorted list

input:
- values (not sorted) as a list

returns:
- the mean of the values as a float

In [14]:
def get_mean(values):
        
    return sum(values) / len(values)

input:
- values (not sorted) as a list

returns:
- the median of the values as a float

In [15]:
def get_median(values):
    values = sorted(values)
    print(values)
    mid = len(values) // 2
    return (values[mid] + values[~mid]) / 2

In [16]:
# DO NOT EDIT THIS CELL
print(f"group 1\t\tmean: {get_mean(list(group1['t_virus']))} \tmedian: {get_median(list(group1['t_virus']))}")
print(f"group 2\t\tmean: {get_mean(list(group2['t_virus']))} \tmedian: {get_median(list(group2['t_virus']))}")

[6.0, 6.7, 7.5, 7.6, 8.4, 9.3, 9.4, 9.6, 10.6, 11.2, 11.3, 11.5, 11.5, 11.9, 13.0, 13.8, 14.1, 14.9, 15.2, 15.8, 16.5, 17.7, 21.4, 55.5, 60.0, 61.4, 62.1, 64.9, 64.9, 69.0, 70.1, 73.4, 74.9]
group 1		mean: 28.21515151515151 	median: 14.1
[76.8, 77.0, 77.9, 80.1, 81.2, 83.8, 90.7]
group 2		mean: 81.07142857142857 	median: 80.1


#### Question 5: Data Normalization (15 Ppoints) 
Complete the following function that normalizes the values using the min-max normalization method:<br/>
input:
- values (not sorted) as a list

returns:
- min-max normalized values

In [17]:
def min_max(values):
    normalized_values = []
    max_v = max(values)
    min_v = min(values)
    for i in values:
        normalization = (i - min_v) / (max_v - min_v)
        normalized_values.append(normalization)
    return normalized_values

In [18]:
# DO NOT EDIT THIS CELL
group1['anti_a'] = min_max(list(group1['anti_a']))
group1['anti_b'] = min_max(list(group1['anti_b']))

group2['anti_a'] = min_max(list(group2['anti_a']))
group2['anti_b'] = min_max(list(group2['anti_b']))

In [19]:
# DO NOT CHANGE THIS CELL
group1

Unnamed: 0,anti_a,anti_b,t_virus
0,0.57216,0.961658,9.3
1,0.791198,0.144041,11.5
2,0.433982,0.033161,15.2
3,0.873081,0.483938,15.8
4,0.273286,0.801036,11.5
6,0.730809,0.588601,74.9
7,0.416581,0.535751,60.0
8,0.988741,0.556477,8.4
9,0.759468,0.778238,73.4
10,0.047083,0.100518,61.4


In [20]:
# DO NOT CHANGE THIS CELL
group2

Unnamed: 0,anti_a,anti_b,t_virus
5,0.272956,0.272321,77.9
16,0.81761,0.867188,76.8
17,0.822642,0.879464,90.7
18,0.291824,0.127232,77.0
20,1.0,0.777902,81.2
29,0.851572,1.0,83.8
34,0.0,0.0,80.1


#### Question 6 (15 Points)
Complete the below function that computes the Pearson correlation coefficient between the normalized Anti-A and normalized Anti-B for each group
input: 
- X: values for anti-a
- Y: values for anti-b

returns:
- the Pearson correlation => as a flot

In [25]:
def get_pearson(X, Y):
    pearson = 0
    X_mean = sum(X) / len(X)
    Y_mean = sum(Y) / len(Y)
    x = [var - X_mean for var in X]
    y = [var - Y_mean for var in Y]

    xy =[i*j for i,j in list(zip(x,y))]
    sum_xy = sum(xy)

    X_square = [i*i for i in x]
    Y_square = [j*j for j in y]

    summation_x_square = sum(X_square)
    summation_y_square = sum(Y_square)

    denominator = math.sqrt(summation_x_square) * math.sqrt(summation_y_square)
    pearson = sum_xy/denominator
return pearson

In [26]:
# DO NOT EDIT THIS CELL
peasron1 = get_pearson(list(group1['anti_a']),list(group1['anti_b']))
peasron2 = get_pearson(list(group2['anti_a']),list(group2['anti_b']))

print(f"Pearson correlation\n\tGroup1: {peasron1},\n\tGroup2: {peasron2}")

Pearson correlation
	Group1: -0.15276464415341665,
	Group2: 0.9491728722979678


#### Question 7: Analytical Thinking (15 Points)
Based on the above analyses, conclude how to best use the two medicines to fight with T-Virus. Your answer is expected to analytical, not just numbers. You need to justify your answer.

Your answer:

Results from above experiments are as follows:-

Pearson correlation
	Group1: -0.15276464415341665,
	Group2: 0.9491728722979678
    
Q. How to best use the two medicines to fight with T-Virus?

From the above result of 2 groups, 
1. In group 2, antigen a and antigen b are strongly correlated which implies that both of these features are  dependent on each other. 
2. Clearly, we could also see that antigen a's mean is very low compared to antigen b. 
3. With this, we could say that giving either anitgen a or antigen b could cure the virus. 
4. If we assume that cost of both the antigens are same, we could proceed with antigen a, as giving it in small proportion itself could cure compared to antigen b which is given in high proption. 
5. Also, even if antigen a is costly, giving it in small propotion to patients is helpful as they would have less chemical impact.

