# Basic Statistics Case Study

## BUSINESS PROBLEM - 1

Using lending club loans data, the team would like to test below hypothesis on how different 
factors effecting each other (Hint: You may leverage hypothesis testing using statistical tests)

a. Interest rate is varied for different loan amounts (Less interest charged for high loan 
amounts)

b. Loan length is directly effecting interest rate.

c. Interest rate varies for different purpose of loans

d. There is relationship between FICO scores and Home Ownership. It means that, People 
with owning home will have high FICO scores.

### Import necessary libraries

In [1]:
import pandas as pd
import scipy.stats as stats

### Import the data set

In [2]:
loans = pd.read_csv('LoansData.csv')
loans.head()

Unnamed: 0,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length
0,20000.0,20000.0,8.90%,36 months,debt_consolidation,14.90%,SC,MORTGAGE,6541.67,735-739,14.0,14272.0,2.0,< 1 year
1,19200.0,19200.0,12.12%,36 months,debt_consolidation,28.36%,TX,MORTGAGE,4583.33,715-719,12.0,11140.0,1.0,2 years
2,35000.0,35000.0,21.98%,60 months,debt_consolidation,23.81%,CA,MORTGAGE,11500.0,690-694,14.0,21977.0,1.0,2 years
3,10000.0,9975.0,9.99%,36 months,debt_consolidation,14.30%,KS,MORTGAGE,3833.33,695-699,10.0,9346.0,0.0,5 years
4,12000.0,12000.0,11.71%,36 months,credit_card,18.78%,NJ,RENT,3195.0,695-699,11.0,14469.0,0.0,9 years


### Exploratory Data Analysis

In [3]:
print(loans.shape)

(2500, 14)


In [4]:
loans.dtypes

Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                      object
Loan.Length                        object
Loan.Purpose                       object
Debt.To.Income.Ratio               object
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

In [5]:
loans['Interest.Rate'] = loans['Interest.Rate'].str.replace('%', '').astype(float)
loans['Loan.Length'] = loans['Loan.Length'].str.replace('months', '').astype(float)
loans['Debt.To.Income.Ratio'] = loans['Debt.To.Income.Ratio'].str.replace('%', '').astype(float)
loans.dtypes

Amount.Requested                  float64
Amount.Funded.By.Investors        float64
Interest.Rate                     float64
Loan.Length                       float64
Loan.Purpose                       object
Debt.To.Income.Ratio              float64
State                              object
Home.Ownership                     object
Monthly.Income                    float64
FICO.Range                         object
Open.CREDIT.Lines                 float64
Revolving.CREDIT.Balance          float64
Inquiries.in.the.Last.6.Months    float64
Employment.Length                  object
dtype: object

In [6]:
loans.isna().sum()

Amount.Requested                   1
Amount.Funded.By.Investors         1
Interest.Rate                      2
Loan.Length                        0
Loan.Purpose                       0
Debt.To.Income.Ratio               1
State                              0
Home.Ownership                     1
Monthly.Income                     1
FICO.Range                         2
Open.CREDIT.Lines                  3
Revolving.CREDIT.Balance           3
Inquiries.in.the.Last.6.Months     3
Employment.Length                 77
dtype: int64

### (a) Interest rate is varied for different loan amounts (Less interest charged for high loan amounts).

In [7]:
# H0: Interest rate and loan amounts are not correlated.
# Ha: Interest rate and loan amounts are correlated.

# Confidence Interval: 95% with p-value: 0.05

loans_ = loans[['Interest.Rate', 'Amount.Requested']].copy()
loans_.dropna(inplace = True)

# We will perform correlation test here to check the correlation between the 2 variables,

corr, _ = stats.pearsonr(loans_['Interest.Rate'], loans_['Amount.Requested'])
print('Correlation coefficient:', corr)

if abs(corr >= 0.5):
    print('Reject Null Hypothesis. There is a strong correlation between Interest.Rate & Amount.Requested')
else:
    print('Retain Null Hypothesis. There is no correlation between Interest.Rate & Amount.Requested') 

Correlation coefficient: 0.3324537662008248
Retain Null Hypothesis. There is no correlation between Interest.Rate & Amount.Requested


### (b) Loan length is directly effecting interest rate.

In [8]:
# H0: There is no relation between Loan length and Interest rate.
# Ha: There is a relation between Loan length and Interest rate.

# Confidence Interval: 95% with p-value: 0.05

loans_ = loans[['Interest.Rate', 'Loan.Length']].copy()
loans_.dropna(inplace = True)

# We will perform correlation test here to check the correlation between the 2 variables,

corr, _ = stats.pearsonr(loans_['Interest.Rate'], loans_['Loan.Length'])
print('Correlation coefficient:', corr)

if abs(corr >= 0.5):
    print('Reject Null Hypothesis. There is a strong correlation between Interest.Rate & Loan.Length')
else:
    print('Retain Null Hypothesis. There is no correlation between Interest.Rate & Loan.Length') 

Correlation coefficient: 0.42421960276133236
Retain Null Hypothesis. There is no correlation between Interest.Rate & Loan.Length


### (c) Interest rate varies for different purpose of loans

In [9]:
# H0: Interest rate do not varies for different purpose of loans
# Ha: Interest rate varies for different purpose of loans

# Confidence Interval: 95% with p-value: 0.05

loans_ = loans[['Interest.Rate', 'Loan.Purpose']].copy()
loans_.dropna(inplace = True)

# We will perform spearmanr test here to check the correlation between the 2 variables,

corr, p_val = stats.spearmanr(loans_['Interest.Rate'], loans_['Loan.Purpose'])
print('p-value:', p_val)

if p_val < 0.05:
    print('Reject Null Hypothesis. Interest rate varies for different purpose of loans')
else:
    print('Retain Null Hypothesis. Interest rate do not varies for different purpose of loans') 

p-value: 0.002520113713544295
Reject Null Hypothesis. Interest rate varies for different purpose of loans


### (d) There is relationship between FICO scores and Home Ownership. It means that, People with owning home will have high FICO scores.

In [10]:
# H0: There is no relationship between FICO Scores and Home Ownership.
# Ha: There is a relationship between FICO Scores and Home Ownership.

# Confidence Interval: 95% with p-value: 0.05

crosstab = pd.crosstab(loans['Home.Ownership'], loans['FICO.Range'])

# We will perform spearmanr test here,

chi_test = stats.chi2_contingency(crosstab)
chi_test

print("The chi-square stat is {} and the p-value is {}".format(chi_test[0],chi_test[1]))

if chi_test[1] > 0.05:
    print('Null Hypothesis retained. There is no relationship between FICO Scores and Home Ownership.')
else:
    print('Null Hypothesis rejected. There is a relationship between FICO Scores and Home Ownership.')

The chi-square stat is 473.0524636834602 and the p-value is 1.202159201024428e-35
Null Hypothesis rejected. There is a relationship between FICO Scores and Home Ownership.


## BUSINESS PROBLEM - 2

Two randomly selected pricing experts, Mary and Barry, were asked to independently provide prices for twelve randomly selected orders. Each expert provided one price for each of the twelve orders. We would like to assess if there is any difference in the average price quotes provided by Mary and Barry.

### Import the data set

In [11]:
price_quotes = pd.read_csv('Price_Quotes.csv')
price_quotes

Unnamed: 0,Order_Number,Barry_Price,Mary_Price
0,1,126,114
1,2,110,118
2,3,138,114
3,4,142,111
4,5,146,129
5,6,136,119
6,7,94,97
7,8,103,104
8,9,140,127
9,10,152,133


In [12]:
price_quotes.dtypes

Order_Number    int64
Barry_Price     int64
Mary_Price      int64
dtype: object

In [13]:
# H0: There is no difference in the average price quotes provided by Mary and Barry.
# Ha: There is a difference in the average price quotes provided by Mary and Barry.

# Confidence Interval: 95% with p-value: 0.05

# We will perform ttest_rel test here,

ttest, p_val = stats.ttest_rel(price_quotes['Mary_Price'], price_quotes['Barry_Price'])
print(p_val)

if p_val > 0.05:
    print('Retain null hypothesis. There is no difference in price quotes provided by Mary and Barry.')
else:
    print('Reject null hypothesis. There is a difference in price quotes provided by Mary and Barry.')

0.02840588045242053
Reject null hypothesis. There is a difference in price quotes provided by Mary and Barry.


## BUSINESS PROBLEM - 3

Determine what effect, if any, the reengineering effort had on the incidence behavioral problems and staff turnover. i.e To determine if the reengineering effort changed the critical incidence rate. Is there evidence that the critical incidence rate
improved?

### Import the data set

In [14]:
treatment_facility = pd.read_csv('Treatment_Facility.csv')
treatment_facility

Unnamed: 0,Month,Reengineer,Employee_Turnover,VAR4,VAR5
0,1,Prior,0.0,24.390244,42.682927
1,2,Prior,6.0606,19.354839,25.806452
2,3,Prior,12.1212,35.087719,146.19883
3,4,Prior,3.3333,18.404908,110.429448
4,5,Prior,12.9032,17.964072,23.952096
5,6,Prior,9.6774,41.176471,47.058824
6,7,Prior,11.7647,13.422819,0.0
7,8,Prior,11.4286,31.25,25.0
8,9,Prior,23.0769,17.241379,132.183908
9,10,Prior,15.0,16.574586,16.574586


In [15]:
## To check if there is any difference in the critical incidence prior and post reengineering

# H0: There are same critical incidence prior and post reengineering.
# Ha: There are different critical incidence prior and post reengineering.

# Confidence Interval: 95% with p-value: 0.05

prior = treatment_facility[treatment_facility['Reengineer'] == 'Prior']['VAR5'][:7]
post = treatment_facility[treatment_facility['Reengineer'] == 'Post']['VAR5'][:7]

# We will perform ttest_rel test here,

ttest, p_val = stats.ttest_rel(prior, post)
print(p_val)

if p_val > 0.05:
    print('Retain null hypothesis. There is no difference in critical incidence prior and post reengineering.')
else:
    print('Reject null hypothesis. There is difference in critical incidence prior and post reengineering.')

0.12146440389514208
Retain null hypothesis. There is no difference in critical incidence prior and post reengineering.


In [16]:
## To check if there is any difference in the employee turnover prior and post reengineering

# H0: Employee turnover is same prior and post reengineering.
# Ha: Employee turnover is different prior and post reengineering.

# Confidence Interval: 95% with p-value: 0.05

prior = treatment_facility[treatment_facility['Reengineer'] == 'Prior']['Employee_Turnover'][:7]
post = treatment_facility[treatment_facility['Reengineer'] == 'Post']['Employee_Turnover'][:7]

# We will perform ttest_rel test here,

ttest, p_val = stats.ttest_rel(prior, post)
print(p_val)

if p_val > 0.05:
    print('Retain null hypothesis. Employee turnover is same prior and post reengineering.')
else:
    print('Reject null hypothesis. Employee turnover is not same prior and post reengineering.')

0.07527474813730192
Retain null hypothesis. Employee turnover is same prior and post reengineering.


## BUSINESS PROBLEM - 4

We will focus on the prioritization system. If the system is working, then high priority jobs, on average, 
should be completed more quickly than medium priority jobs, and medium priority jobs should be completed more quickly 
than low priority jobs. Use the data provided to determine whether this is, in fact, occurring.

### Import the data set

In [17]:
priority_assessment = pd.read_csv('Priority_Assessment.csv')
priority_assessment.head()

Unnamed: 0,Days,Priority
0,3.3,High
1,7.9,Medium
2,0.3,High
3,0.7,Medium
4,8.6,Medium


In [18]:
priority_assessment.dtypes

Days        float64
Priority     object
dtype: object

In [19]:
# H0: High, medium and low priority jobs are taking same average time.
# Ha: High, medium and low priority jobs are taking different average time.

# Confidence Interval: 95% with p-value: 0.05

high = priority_assessment[priority_assessment['Priority'] == 'High']['Days']
medium = priority_assessment[priority_assessment['Priority'] == 'Medium']['Days']
low = priority_assessment[priority_assessment['Priority'] == 'Low']['Days']

# We will perform anova f_oneway test here,

ttest, p_val = stats.f_oneway(high, medium, low)
print(p_val)

if p_val > 0.05:
    print('Retain null hypothesis. High, medium and low priority jobs are taking almost same average time.')
else:
    print('Reject null hypothesis. High, medium and low priority jobs are taking different average time.')

0.16411459461716182
Retain null hypothesis. High, medium and low priority jobs are taking almost same average time.


## BUSINESS PROBLEM - 5

Use the survey results to address the following questions:

- What is the overall level of customer satisfaction?

- What factors are linked to satisfaction?

- What is the demographic profile of Film on the Rocks patrons?

- In what media outlet(s) should the film series be advertised?

### Import the data set

In [30]:
films = pd.read_csv('Films.csv')
films.head()

Unnamed: 0,_rowstate_,Movie,Gender,Marital_Status,Sinage,Parking,Clean,Overall,Age,Income,Hear_About
0,0,Ferris Buellers Day Off,Female,Married,2.0,2.0,2.0,2.0,3.0,1.0,5
1,0,Ferris Buellers Day Off,Female,Single,1.0,1.0,1.0,1.0,2.0,1.0,5
2,0,Ferris Buellers Day Off,Male,Married,2.0,4.0,3.0,2.0,4.0,1.0,5
3,0,Ferris Buellers Day Off,Female,Married,1.0,3.0,2.0,2.0,4.0,1.0,5
4,0,Ferris Buellers Day Off,Female,Married,1.0,1.0,1.0,1.0,3.0,3.0,1


In [46]:
films.loc[films['Gender'] == '1', 'Gender'] = 'Male'
films.loc[films['Gender'] == '2', 'Gender'] = 'Female'
films.loc[films['Marital_Status'] == '1', 'Marital_Status'] = 'Married'
films.loc[films['Marital_Status'] == '2', 'Marital_Status'] = 'Single'
films.head()

Unnamed: 0,_rowstate_,Movie,Gender,Marital_Status,Sinage,Parking,Clean,Overall,Age,Income,Hear_About
0,0,Ferris Buellers Day Off,Female,Married,2.0,2.0,2.0,2.0,3.0,1.0,5
1,0,Ferris Buellers Day Off,Female,Single,1.0,1.0,1.0,1.0,2.0,1.0,5
2,0,Ferris Buellers Day Off,Male,Married,2.0,4.0,3.0,2.0,4.0,1.0,5
3,0,Ferris Buellers Day Off,Female,Married,1.0,3.0,2.0,2.0,4.0,1.0,5
4,0,Ferris Buellers Day Off,Female,Married,1.0,1.0,1.0,1.0,3.0,3.0,1


In [47]:
films.dtypes

_rowstate_          int64
Movie              object
Gender             object
Marital_Status     object
Sinage            float64
Parking           float64
Clean             float64
Overall           float64
Age               float64
Income            float64
Hear_About         object
dtype: object