# Hypothesis Case Study 

Description about data: The data set (cust_seg) is information of 200
customers who were part of the last campaign conducted by one of the major
Bank-credit card division. The CMO would like to test below hypothesis based
on the data.
1. Card usage has been improved significantly from last year usage which is 50. (Hint:
Comparing card usage of post campaign of 1 month with last year hypothesized value 50)
2. The last campaign was successful in terms usage of credit card. (Hint: Comparing means
for card usage of pre & post usage of campaign)
3. Is there any difference between males & females in terms of credit card usage? (Hint:
Comparing means of card usage for males & females)
4. Is there any difference between segments of customers in terms of credit card usage?
(Hint: Comparing means of card usage of different segment customers)
5. Is there any relation between region & Segment? (Hint: Finding the relationship
between categorical variables region and Segment)
6. Is the relationship between card usage in the latest month and pre usage of campaign?

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("/Users/rahulmeena/Imarticus/GitHub/IMARTICUS/Datasets/class_exercise/cust_seg.csv")

In [3]:
df.head()

Unnamed: 0,custid,sex,AqChannel,region,Marital_status,segment,pre_usage,Post_usage_1month,Latest_mon_usage,post_usage_2ndmonth
0,70,0,4,1,1,1,57,52,49.2,57.2
1,121,1,4,2,1,3,68,59,63.6,64.9
2,86,0,4,3,1,1,44,33,64.8,36.3
3,141,0,4,3,1,3,63,44,56.4,48.4
4,172,0,4,2,1,2,47,52,68.4,57.2


In [4]:
from scipy import stats

## **Q1** Card usage has been improved significantly from last year usage which is 50. 

In [5]:
stats.ttest_1samp(a=df["pre_usage"], popmean=50, alternative="greater")

TtestResult(statistic=3.075895518975875, pvalue=0.0011969244515674076, df=199)

**As p-value is less than α value which means there are not enough statistical evidence in support of null, so we fail to accept null hyposthesis. So we accept alternative i.e. sample mean is greater than 50. Hence this year average spending has increased significantly.**

## **Q2** The last campaign was successful in terms usage of credit card. 

Null Hypothesis: Avg. spending before compaign = avg. spending after compaign </br>
Alter Hypothesis: Avg. spending before compaign < avg. spending after compaign

Confidence Intervel (CI): 99% => α = 0.01 (1%)

Data collection

Test: paired sapmple t-test

Decision Rule: Decision Rule: if p value is greater or equal to alpha then we except NULL HYPOTHESIS else ALTER HYPOTHESIS.

In [8]:
stats.ttest_rel(a=df["pre_usage"], b=df["Latest_mon_usage"], alternative="less")

TtestResult(statistic=-17.431588891905882, pvalue=3.1680448538120343e-42, df=199)

**As p-value is less than α value which means there are not enough statistical evidence in support of null, so we fail to accept null hyposthesis. So we accept alternative i.e. average spending before compaign is less than average spending after compaign, means campaign is successfull.**

## **Q3** Is there any difference between males & females in terms of credit card usage? (Hint: Comparing means of card usage for males & females)

Null Hypothesis: Avg. spending of male = avg. spending of female </br>
Alter Hypothesis: Avg. spending of male <> avg. spending of female

Confidence Intervel (CI): 95% => α = 0.05 (5%)

Data collection

Test: Independent Sample t-test / two sample t-test

Decision Rule: Decision Rule: if p value is greater or equal to alpha then we except NULL HYPOTHESIS else ALTER HYPOTHESIS.

In [9]:
male_data = df[df["sex"]==1]
female_data = df[df["sex"]==0]

In [10]:
stats.ttest_ind(a=male_data['pre_usage'], b=female_data['pre_usage'], alternative='two-sided')

TtestResult(statistic=-0.7480109580953392, pvalue=0.4553410655360075, df=198.0)

In [11]:
stats.ttest_ind(a=male_data['Post_usage_1month'], b=female_data['Post_usage_1month'], alternative='two-sided')

TtestResult(statistic=3.7340738531536797, pvalue=0.0002462546120354932, df=198.0)

In [12]:
stats.ttest_ind(a=male_data['post_usage_2ndmonth'], b=female_data['post_usage_2ndmonth'], alternative='two-sided')

TtestResult(statistic=3.7340738531536926, pvalue=0.00024625461203548154, df=198.0)

In [13]:
stats.ttest_ind(a=male_data['Latest_mon_usage'], b=female_data['Latest_mon_usage'], alternative='two-sided')

TtestResult(statistic=-0.4129986492968787, pvalue=0.680054497423219, df=198.0)

**As p-value is less than α value on expenditure of males and females pre-compaign and after 2 months post compaign. For current month p-value is greater than α value.**

## **Q4** Is there any difference between segments of customers in terms of credit card usage?

Null Hypothesis: Avg. spend of seg. 1 = 2 = 3

Alter Hypothesis: Atleast one segment has different mean spend

Confidence Intervel (CI): 95% => α = 0.05 (5%)

Data collection

Test: ANOVA

Decision Rule: Decision Rule: if p value is greater or equal to alpha then we except NULL HYPOTHESIS else ALTER HYPOTHESIS.

In [18]:
df["segment"].value_counts()

segment
2    105
3     50
1     45
Name: count, dtype: int64

In [16]:
# 1 - Classic - (1L - 2L)
# 2 - Preferred - (2L - 5L)
# 3 - Imperia - (5L - above)

In [None]:
stats.f_oneway()

## **Q5** Is there any relation between region & Segment? 

Null Hypothesis: No relation

Alter Hypothesis: There is relation

Confidence Intervel (CI): 95% => α = 0.05 (5%)

Data collection

Test: Chi

Decision Rule: Decision Rule: if p value is greater or equal to alpha then we except NULL HYPOTHESIS else ALTER HYPOTHESIS.

In [19]:
ob = pd.crosstab(df["region"], df["segment"])
ob

segment,1,2,3
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,16,19,12
2,20,44,31
3,9,42,7


In [20]:
stats.chi2_contingency(ob)

Chi2ContingencyResult(statistic=16.604441649489342, pvalue=0.0023066300908054713, dof=4, expected_freq=array([[10.575, 24.675, 11.75 ],
       [21.375, 49.875, 23.75 ],
       [13.05 , 30.45 , 14.5  ]]))

## **Q6** Is the relationship between card usage in the latest month and pre usage of campaign?

Null Hypothesis: No relation

Alter Hypothesis: There is relation

Confidence Intervel (CI): 95% => α = 0.05 (5%)

Data collection

Test: co-relation

Decision Rule: Decision Rule: if p value is greater or equal to alpha then we except NULL HYPOTHESIS else ALTER HYPOTHESIS.

In [22]:
## corelation test
stats.stats.pearsonr(df["pre_usage"], df["Latest_mon_usage"])

  stats.stats.pearsonr(df["pre_usage"], df["Latest_mon_usage"])


PearsonRResult(statistic=0.6622801251558603, pvalue=1.2767419295068642e-26)

In [23]:
df[["pre_usage", "Latest_mon_usage"]].corr()

Unnamed: 0,pre_usage,Latest_mon_usage
pre_usage,1.0,0.66228
Latest_mon_usage,0.66228,1.0
