In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import scipy.stats as stats
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Lok+Sabha_2019.csv')

In [3]:
df.head()

Unnamed: 0,STATE,CONSTITUENCY,NAME,WINNER,PARTY,SYMBOL,GENDER,CRIMINAL CASES,AGE,CATEGORY,EDUCATION,ASSETS,LIABILITIES,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS
0,Telangana,ADILABAD,SOYAM BAPU RAO,1,BJP,Lotus,MALE,52.0,52,ST,Basic Level,3099414.0,231450.0,376892,482,377374,25.33,35.47,1489790
1,Telangana,ADILABAD,Godam Nagesh,0,TRS,Car,MALE,0.0,54,ST,Post Graduate,18477888.0,847000.0,318665,149,318814,21.4,29.96,1489790
2,Telangana,ADILABAD,RATHOD RAMESH,0,INC,Hand,MALE,3.0,52,ST,Basic Level,36491000.0,15300000.0,314057,181,314238,21.09,29.53,1489790
3,Uttar Pradesh,AGRA,Satyapal Singh Baghel,1,BJP,Lotus,MALE,5.0,58,SC,Doctorate,74274036.0,8606522.0,644459,2416,646875,33.38,56.46,1937690
4,Uttar Pradesh,AGRA,Manoj Kumar Soni,0,BSP,Elephant,MALE,0.0,47,SC,Post Graduate,133784385.0,22251891.0,434199,1130,435329,22.47,38.0,1937690


# 3.1. After the elections, Association For Democratic Reforms(ADR) is responsible for analysing the data of the elections. In 2014 elections, they observed that the average assets declared by the candidates from the state of Bihar was around 5 crores. Before the 2019 elections, it claimed that this average amount of assets will be greater than the amount in 2014 for the Bihar candidates. State the null and alternate hypothesis. Perform an appropriate statistical test to verify the claim made by ADR. Decide whether the null hypothesis is supported or refuted. (use, alpha = 0.05)

### Solution:
H0: Avg. Assets for a Bihar candidate <= 5 crore

H1: Avg. Assets for a Bihar candidate > 5 crore

In [4]:
tstat,pvalue = stats.ttest_1samp(np.array(df[df['STATE']=='Bihar']['ASSETS'].dropna()),50000000)

In [5]:
pvalue

0.30884189228789083

"ttest_1samp" returns a result for two sided t-test(ref: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html).

In [6]:
print('So, the p-value will be:',pvalue/2)

So, the p-value will be: 0.15442094614394541


#### Alternavite method

In [7]:
mu = 50000000
xbar = np.array(df[df['STATE']=='Bihar']['ASSETS'].dropna()).mean()
samp_std_dev = np.std(np.array(df[df['STATE']=='Bihar']['ASSETS'].dropna()),ddof = 1)
n = len(np.array(df[df['STATE']=='Bihar']['ASSETS'].dropna()))
t_stat = (xbar - mu) / (samp_std_dev/np.sqrt(n))
t_crit = stats.t.isf(0.05,n-1,loc = 50000000)
pvalue = stats.t.sf(t_stat,n-1)
alpha = stats.t.sf(t_crit,n-1,50000000)

In [8]:
pvalue

0.15442094614394541

As we can see, the pvalue is > 0.05. So, we do not have enough evidence to prove that the average assets for a Bihar candidate will be greater than 5 crore. Hence, we fail to reject the null hypothesis.

# 3.2. It is assumed that the candidates from state of Uttar Pradesh and West Bengal accounts for approximately same number of average total votes. The ADR wants to check whether this assumption can be refuted or not. Form the hypothesis and conduct the test of the hypothesis to test whether the assumption is true. What assumptions do you need to check before the test for equality of means is performed? (use, alpha = 0.05)

### Solution:
H0: Avg. total votes for Uttar Pradesh candidates = Avg. total votes for West Bengal candidates

H1: Avg. total votes for Uttar Pradesh candidates != Avg. total votes for West Bengal candidates

To perform Hypothesis Testing, the following assumptions must hold,
1. The variables must follow continious distribution
2. The sample must be randomly collected from the population
3. The underlying distribution must be normal. Alternatively, if the data is continious, but may not be assumed to follow a normal distribution, a reasonlably large sample size is required. CLT asserts that sample mean follows a normal distribution, even if the population distribution is not normal, when sample size is atleast 30.
4. For 2 sample t-test, the population variances of 2 distributions must be equal.

In [9]:
up_data = np.array(df[df['STATE']=='Uttar Pradesh']['TOTAL VOTES'].dropna())
wb_data = np.array(df[df['STATE']=='West Bengal']['TOTAL VOTES'].dropna())
tstat,pvalue = stats.ttest_ind(up_data,wb_data,equal_var=True,nan_policy='omit')

In [10]:
pvalue

0.5864954528648343

As we can see, the pvalue is > 0.05. So, we do not have enough evidence to reject the null hypothesis. Hence, we cannot refute the assumption that the candidates from state of Uttar Pradesh and West Bengal account for approximately same number of average total votes.