### Q1.A physician is evaluating a new diet for her patients with a family history of heart disease. To test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights and triglyceride levels are measured before and after the study, and the physician wants to know if either set of measurements has changed. (Data set: dietstudy.csv) 
 

In [1]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import scipy.stats as stats

%matplotlib inline

In [2]:
diet = pd.read_csv('dietstudy.csv')

diet.head()

Unnamed: 0,patid,age,gender,tg0,tg1,tg2,tg3,tg4,wgt0,wgt1,wgt2,wgt3,wgt4
0,1,45,Male,180,148,106,113,100,198,196,193,188,192
1,2,56,Male,139,94,119,75,92,237,233,232,228,225
2,3,50,Male,152,185,86,149,118,233,231,229,228,226
3,4,46,Female,112,145,136,149,82,179,181,177,174,172
4,5,64,Male,156,104,157,79,97,219,217,215,213,214


In [3]:
diet.shape

(16, 13)

## Two Sample T-Test (Paired)

In [4]:
print("The triglyceride levels of patients were {}".format(diet.tg0.mean()))
print("The final triglyceride levels of patients are {}".format(diet.tg4.mean()))
print("The weights of pateints were {}".format(diet.wgt0.mean()))
print("The final weights of pateints are {}".format(diet.wgt4.mean()))


The triglyceride levels of patients were 138.4375
The final triglyceride levels of patients are 124.375
The weights of pateints were 198.375
The final weights of pateints are 190.3125


#### Stage 1

#### H0:  levels of Triglycerides of individual before diet == levels of Triglycerides of individual after diet 

#### H1:  levels of Triglycerides of individual before diet != levels of Triglycerides of individual after diet 

In [5]:
#stats.ttest_rel(a =triglyceride levels before diet, b = triglyceride levels after diet)
stage1_tri = stats.ttest_rel(a=diet.tg0,
                b=diet.tg4) 

In [6]:
stage1_tri.pvalue

0.24874946576903698

In [7]:
stage1_tri.pvalue>0.05

True

#### Stage 2

#### H0:  weights of individual before diet == weights of individual after diet 

#### H1:  weights of individual before diet != weights of individual after diet 

In [8]:
#stats.ttest_rel(a = weights before diet, b = weights after diet)
stage2_wei = stats.ttest_rel(a=diet.wgt0,
                b=diet.wgt4) 

In [9]:
stage2_wei.pvalue

1.137689414996614e-08

In [10]:
stage2_wei.pvalue>0.05

False

### Conclusions

* 1. Since the significance value for change in weight is less than 0.05,we can conclude that the average loss of 8.06 pounds per patient is not due to chance variation, and can be attributed to the diet.

* 2. However, the significance value greater than 0.05 for change in triglyceride level shows the diet did not significantly reduce their triglyceride levels.

### Q2.An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales? 

### Two Sample T-Test (Independent)

In [19]:
ana_cre = pd.read_csv('creditpromo.csv')

ana_cre.head()

Unnamed: 0,id,insert,dollars
0,148,Standard,2232.771979
1,572,New Promotion,1403.807542
2,973,Standard,2327.092181
3,1096,Standard,1280.030541
4,1541,New Promotion,1513.5632


In [31]:
half_standard = ana_cre['dollars'].loc[ana_cre['insert']=="Standard"]

half_promotion = ana_cre['dollars'].loc[ana_cre['insert']=="New Promotion"]


In [32]:
print("The average spent by the normal standard seasonal ad is ${}".format(half_standard.mean()))

print("The average spent by the norma ad is ${}".format(half_promotion.mean()))

The average spent by the normal standard seasonal ad is $1566.3890309659348
The average spent by the norma ad is $1637.4999830647992


* On average, customers who received the interest-rate promotion charged about $70 more than the normal standard season ad, and they vary a little more around their average.

In [33]:
eq_var = stats.ttest_ind(a= half_standard,
                b= half_promotion,
                equal_var=True)    # equal variance
eq_var.statistic

-2.2604227264649963

In [34]:
uneq_var = stats.ttest_ind(a= half_standard,
                b= half_promotion,
                equal_var=False)    # UnEqual variance
uneq_var.statistic

-2.260422726464996

In [35]:
# We'll cosider equal variance since the t score is not having a huge difference
uneq_var.statistic - eq_var.statistic

4.440892098500626e-16

In [38]:
t = eq_var.statistic

p = eq_var.pvalue

print(" For the above test, the t-score is {} and the p-value is {}".format(t,p))

if(p<0.05):
    print('We reject null hypothesis')
else:
    print('We fail to reject null hypothesis')

 For the above test, the t-score is -2.2604227264649963 and the p-value is 0.024225996894147814
We reject null hypothesis


* Since the significance value of the test is less than 0.05, we can safely conclude that the average of 71.11 dollars more spent by cardholders receiving the reduced interest rate is not due to chance alone. The store will now consider extending the offer to all credit customers.

### Q3. An experiment is conducted to study the hybrid seed production of bottle gourd under open field conditions. The main aim of the investigation is to compare natural pollination and hand pollination. The data are collected on 10 randomly selected plants from each of natural pollination and hand pollination. The data are collected on fruit weight (kg), seed yield/plant (g) and seedling length (cm). (Data set: pollination.csv)
### a. Is the overall population of Seed yield/plant (g) equals to 200? 


### One Sample t-Test

Ho : Population of Seed yield/plant == 200

Ha :  Population of Seed yield/plant != 200

In [39]:
poll = pd.read_csv('pollination.csv')

poll.head()

Unnamed: 0,Group,Fruit_Wt,Seed_Yield_Plant,Seedling_length
0,Natural,1.85,147.7,16.86
1,Natural,1.86,136.86,16.77
2,Natural,1.83,149.97,16.35
3,Natural,1.89,172.33,18.26
4,Natural,1.8,144.46,17.9


In [40]:
poll.Seed_Yield_Plant.mean()

180.8035

In [41]:
ttest_1 = stats.ttest_1samp(a=poll.Seed_Yield_Plant, popmean = 200)
t_score = round(ttest_1.statistic,2)
p_value = ttest_1.pvalue
print("The p value is {} and the T Score is {}".format(p_value,t_score))

The p value is 0.032891040921283025 and the T Score is -2.3


In [42]:
p_value>0.05

False

* Since the significance value of the test is less than 0.05, we reject the null hypothesis. Therefore overall population of seed yield/plant is not equal to 200.

### b. Test whether the natural pollination and hand pollination under open field conditions are equally effective or are significantly different. 

### F-Test/Anova

H0  : mean Fruit_Wt == mean Seed_Yield_Plant == mean Seedling_length

H1  : mean Fruit_Wt <> mean Seed_Yield_Plant <> mean Seedling_length

In [94]:
natural_fruit_wt = poll['Fruit_Wt'].loc[poll['Group']=="Natural"]
hand_fruit_wt = poll['Fruit_Wt'].loc[poll['Group']=="Hand"]

# Perfrom the Anova
anova = stats.f_oneway(natural_fruit_wt,hand_fruit_wt)
# Statistic :  F Value
f1 = anova.statistic
p1 = anova.pvalue
print("The f-value is {} and the p value is {}".format(f1,p1))



The f-value is 312.228532974426 and the p value is 8.078362076486568e-13


In [96]:
natural_seed_yield = poll['Seed_Yield_Plant'].loc[poll['Group']=="Natural"]
hand_seed_yield = poll['Seed_Yield_Plant'].loc[poll['Group']=="Hand"]

# Perfrom the Anova
anova = stats.f_oneway(natural_seed_yield,hand_seed_yield)
# Statistic :  F Value
f2 = anova.statistic
p2 = anova.pvalue
print("The f-value is {} and the p value is {}".format(f2,p2))

The f-value is 194.83303662980398 and the p value is 4.271481585484407e-11


In [97]:
natural_seedling_length = poll['Seedling_length'].loc[poll['Group']=="Natural"]
hand_seedling_length = poll['Seedling_length'].loc[poll['Group']=="Hand"]

# Perfrom the Anova
anova = stats.f_oneway(natural_seedling_length,hand_seedling_length)
# Statistic :  F Value
f3 = anova.statistic
p3 = anova.pvalue
print("The f-value is {} and the p value is {}".format(f3,p3))

The f-value is 6.46293337115627 and the p value is 0.020428817064110556


### 4. An electronics firm is developing a new DVD player in response to customer requests. Using a prototype, the marketing team has collected focus data for different age groups viz. Under 25; 25-34; 35-44; 45-54; 55-64; 65 and above. Do you think that consumers of various ages rated the design differently? (Data set: dvdplayer.csv)

In [67]:
dvd = pd.read_csv('dvdplayer.csv')

dvd.head()

Unnamed: 0,agegroup,dvdscore
0,65 and over,38.454803
1,55-64,17.669677
2,65 and over,31.704307
3,65 and over,25.92446
4,Under 25,30.450007


### F-Test/Anova

In [70]:
age_group_1 = dvd['dvdscore'].loc[dvd['agegroup']=="65 and over"]
age_group_2 = dvd['dvdscore'].loc[dvd['agegroup']=="55-64"]
age_group_3 = dvd['dvdscore'].loc[dvd['agegroup']=="Under 25"]
age_group_4 = dvd['dvdscore'].loc[dvd['agegroup']=="35-44"]
age_group_5 = dvd['dvdscore'].loc[dvd['agegroup']=="45-54"]
age_group_6 = dvd['dvdscore'].loc[dvd['agegroup']=="25-34"]

# Perfrom the Anova
anova = stats.f_oneway(age_group_1,age_group_2,age_group_3,age_group_4,age_group_5,age_group_6)
# Statistic :  F Value
f = anova.statistic
p = anova.pvalue
print("The f-value is {} and the p value is {}".format(f,p))
if(p<0.05):
    print('We reject null hypothesis')
else:
    print('We fail to reject null hypothesis')

The f-value is 6.992526962676517 and the p value is 3.087324905679639e-05
We reject null hypothesis


* Since the significance value of the test is less than 0.05, we reject the null hypothesis. Therefore,consumers of various ages rated the design differently

### 5.  A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv).  
### a. Is there any relationship in between labour force status with marital status? 

### Chi-Square Test

In [71]:
sam_sur = pd.read_csv('sample_survey.csv')

sam_sur.head()

Unnamed: 0,id,wrkstat,marital,childs,age,educ,paeduc,maeduc,speduc,degree,...,agecat,childcat,news1,news2,news3,news4,news5,car1,car2,car3
0,1,Working full time,Divorced,2.0,60.0,12.0,12.0,12.0,,High school,...,55 to 64,1-2,No,No,No,No,No,American,Japanese,Japanese
1,2,Working part-time,Never married,0.0,27.0,17.0,20.0,,,Junior college,...,25 to 34,,No,No,Yes,No,No,American,German,Japanese
2,3,Working full time,Married,2.0,36.0,12.0,12.0,12.0,16.0,High school,...,35 to 44,1-2,No,No,No,Yes,Yes,American,American,
3,4,Working full time,Never married,0.0,21.0,13.0,,12.0,,High school,...,Less than 25,,No,No,No,Yes,Yes,American,Other,
4,5,Working full time,Never married,0.0,35.0,16.0,,12.0,,Bachelor,...,35 to 44,,No,No,No,No,No,American,American,Korean


In [78]:
lab_mar_xtab = pd.crosstab(sam_sur.wrkstat, sam_sur.marital, margins = True)
lab_mar_xtab

marital,Divorced,Married,Never married,Separated,Widowed,All
wrkstat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Keeping house,25,200,35,13,55,328
Other,12,16,14,4,8,54
Retired,53,168,17,6,150,394
School,7,9,60,2,1,79
Temporarily not working,9,23,11,1,2,46
"Unemployed, laid off",10,13,32,0,3,58
Working full time,295,778,392,58,44,1567
Working part-time,35,138,102,9,20,304
All,446,1345,663,93,283,2830


In [79]:
x2test = stats.chi2_contingency(observed= lab_mar_xtab)

x2test

(729.2421426572284,
 1.820339965538765e-127,
 40,
 array([[5.16918728e+01, 1.55886926e+02, 7.68424028e+01, 1.07787986e+01,
         3.28000000e+01, 3.28000000e+02],
        [8.51024735e+00, 2.56643110e+01, 1.26508834e+01, 1.77455830e+00,
         5.40000000e+00, 5.40000000e+01],
        [6.20932862e+01, 1.87254417e+02, 9.23045936e+01, 1.29477032e+01,
         3.94000000e+01, 3.94000000e+02],
        [1.24501767e+01, 3.75459364e+01, 1.85077739e+01, 2.59611307e+00,
         7.90000000e+00, 7.90000000e+01],
        [7.24946996e+00, 2.18621908e+01, 1.07766784e+01, 1.51166078e+00,
         4.60000000e+00, 4.60000000e+01],
        [9.14063604e+00, 2.75653710e+01, 1.35879859e+01, 1.90600707e+00,
         5.80000000e+00, 5.80000000e+01],
        [2.46954770e+02, 7.44740283e+02, 3.67109894e+02, 5.14950530e+01,
         1.56700000e+02, 1.56700000e+03],
        [4.79095406e+01, 1.44480565e+02, 7.12197880e+01, 9.99010601e+00,
         3.04000000e+01, 3.04000000e+02],
        [4.46000000e+02, 1.345

In [80]:
print("The chi square stat is {} and the p value is {}".format(x2test[0],x2test[1]))

The chi square stat is 729.2421426572284 and the p value is 1.820339965538765e-127


* Since the significance value of the test is less than 0.05, we reject the null hypothesis. Therefore there is a realtion between labour force status and marital status

### b. Do you think educational qualification is somehow controlling the marital status?  

### Chi- Square Test

In [81]:
edu_mar_xtab = pd.crosstab(sam_sur.degree, sam_sur.marital, margins = True)
edu_mar_xtab

marital,Divorced,Married,Never married,Separated,Widowed,All
degree,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bachelor,58,251,129,12,28,478
Graduate,29,123,41,3,9,205
High school,241,686,367,58,148,1500
Junior college,45,108,46,3,6,208
LT High school,70,174,77,17,92,430
All,443,1342,660,93,283,2821


In [82]:
x2test_b = stats.chi2_contingency(observed= edu_mar_xtab)

x2test_b

(122.68449020508541,
 7.424404099753273e-15,
 25,
 array([[  75.06345268,  227.39312301,  111.83268345,   15.75824176,
           47.95249911,  478.        ],
        [  32.19248493,   97.52215526,   47.9617157 ,    6.75824176,
           20.56540234,  205.        ],
        [ 235.55476781,  713.57674583,  350.9393832 ,   49.45054945,
          150.4785537 , 1500.        ],
        [  32.66359447,   98.94930876,   48.66359447,    6.85714286,
           20.86635945,  208.        ],
        [  67.52570011,  204.55866714,  100.60262318,   14.17582418,
           43.1371854 ,  430.        ],
        [ 443.        , 1342.        ,  660.        ,   93.        ,
          283.        , 2821.        ]]))

In [83]:
print("The chi square stat is {} and the p value is {}".format(x2test_b[0],x2test_b[1]))

The chi square stat is 122.68449020508541 and the p value is 7.424404099753273e-15


* Since the significance value of the test is less than 0.05, we reject the null hypothesis. Therefore there educational qualification is somehow controlling the marital status.

### c.  Is happiness is driven by earnings or marital status?

### Chi-Square Test

In [84]:
hap_ear_xtab = pd.crosstab(sam_sur.happy, sam_sur.income, margins = True)
hap_ear_xtab

income,$1000 TO 2999,$10000 - 14999,$15000 - 19999,$20000 - 24999,$25000 or more,$3000 TO 3999,$4000 TO 4999,$5000 TO 5999,$6000 TO 6999,$7000 TO 7999,$8000 TO 9999,LT $1000,All
happy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Not too happy,7,39,33,40,113,9,9,6,14,12,9,11,302
Pretty happy,20,107,119,155,888,11,13,18,13,21,30,13,1408
Very happy,5,44,26,50,571,4,10,11,6,14,19,11,771
All,32,190,178,245,1572,24,32,35,33,47,58,35,2481


In [85]:
x2test_ear = stats.chi2_contingency(observed= hap_ear_xtab)

x2test_ear

(178.95053061216427,
 7.234749067043371e-21,
 36,
 array([[   3.89520355,   23.12777106,   21.66706973,   29.82265216,
          191.35187424,    2.92140266,    3.89520355,    4.26037888,
            4.01692866,    5.72108021,    7.06005643,    4.26037888,
          302.        ],
        [  18.16041919,  107.82748892,  101.01733172,  139.04070939,
          892.1305925 ,   13.62031439,   18.16041919,   19.86295848,
           18.72793229,   26.67311568,   32.91575977,   19.86295848,
         1408.        ],
        [   9.94437727,   59.04474002,   55.31559855,   76.13663845,
          488.51753325,    7.45828295,    9.94437727,   10.87666264,
           10.25513906,   14.60580411,   18.0241838 ,   10.87666264,
          771.        ],
        [  32.        ,  190.        ,  178.        ,  245.        ,
         1572.        ,   24.        ,   32.        ,   35.        ,
           33.        ,   47.        ,   58.        ,   35.        ,
         2481.        ]]))

In [90]:
print("The chi square stat of happiness w.r.t to earnings {} and the p value is {}".format(x2test_ear[0],x2test_ear[1]))

The chi square stat of happiness w.r.t to earnings 178.95053061216427 and the p value is 7.234749067043371e-21


In [87]:
hap_mar_xtab = pd.crosstab(sam_sur.happy, sam_sur.marital, margins = True)
hap_mar_xtab

marital,Divorced,Married,Never married,Separated,Widowed,All
happy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Not too happy,72,71,108,30,59,340
Pretty happy,278,684,426,49,137,1574
Very happy,93,582,120,13,83,891
All,443,1337,654,92,279,2805


In [89]:
x2test_mar = stats.chi2_contingency(observed= hap_mar_xtab)

x2test_mar

(260.68943894182826,
 7.762777322980048e-47,
 15,
 array([[  53.6969697 ,  162.06060606,   79.27272727,   11.15151515,
           33.81818182,  340.        ],
        [ 248.58538324,  750.24527629,  366.98609626,   51.62495544,
          156.55828877, 1574.        ],
        [ 140.71764706,  424.69411765,  207.74117647,   29.22352941,
           88.62352941,  891.        ],
        [ 443.        , 1337.        ,  654.        ,   92.        ,
          279.        , 2805.        ]]))

In [92]:
print("The chi square stat of happiness w.r.t to marital status {} and the p value is {}".format(x2test_mar[0],x2test_mar[1]))

The chi square stat of happiness w.r.t to marital status 260.68943894182826 and the p value is 7.762777322980048e-47


* As it can be clearly seen that chi_square score for (happiness w.r.t to marital status) i.e. **260.68** is more than chi_square score for (happiness w.r.t to marital earnings) i.e **178.95**, therefore we can conclude that happiness is driven by marital status