<a href="https://colab.research.google.com/github/spentaur/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/module1-statistics-probability-and-inference/LS_DS_131_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 3 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
def start():
    options = {
        'display': {
            'max_columns': None,
            'max_colwidth': 25,
            'expand_frame_repr': False,  # Don't wrap to multiple pages
            'max_rows': 100,
            'max_seq_items': 50,         # Max length of printed sequence
            'precision': 4,
            'show_dimensions': False
        },
        'mode': {
            'chained_assignment': None   # Controls SettingWithCopyWarning
        }
    }
    for category, option in options.items():
        for op, value in option.items():
            pd.set_option(f'{category}.{op}', value)  # Python 3.6+
start()

In [0]:
columns = """1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. el-salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-nicaraguan-contras: 2 (y,n)
10. mx-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)"""

# sloppy and slow, but data is small and it's quicker than writing everything 
# out by hand
columns = columns.split("\n")
columns = [c.split(". ") for c in columns]
columns = [c[1].split(":") for c in columns]
columns = [c[0].replace('-', ' ') for c in columns]

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/\
voting-records/house-votes-84.data', names=columns, na_values="?", 
                true_values='y', false_values='n' )

df.shape


(435, 17)


4. Relevant Information:
      This data set includes votes for each of the U.S. House of
      Representatives Congressmen on the 16 key votes identified by the
      CQA.  The CQA lists nine different types of votes: voted for, paired
      for, and announced for (these three simplified to yea), voted
      against, paired against, and announced against (these three
      simplified to nay), voted present, voted present to avoid conflict
      of interest, and did not vote or otherwise make a position known
      (these three simplified to an unknown disposition).

8. Missing Attribute Values: Denoted by "?"

NOTE: It is important to recognize that "?" in this database does not mean that the value of the attribute is unknown.  It means simply, that the value is not "yea" or "nay" (see "Relevant Information" section above).

   Attribute:  #Missing Values:
           1:  0
           2:  0
           3:  12
           4:  48
           5:  11
           6:  11
           7:  15
           8:  11
           9:  14
          10:  15
          11:  22
          12:  7
          13:  21
          14:  31
          15:  25
          16:  17
          17:  28


# ▼ all this didn't need to be done if i just read the docs lol

In [0]:
(df.isna().sum() / len(df)) * 100

Class Name                                 0.0000
handicapped infants                        2.7586
water project cost sharing                11.0345
adoption of the budget resolution          2.5287
physician fee freeze                       2.5287
el salvador aid                            3.4483
religious groups in schools                2.5287
anti satellite test ban                    3.2184
aid to nicaraguan contras                  3.4483
mx missile                                 5.0575
immigration                                1.6092
synfuels corporation cutback               4.8276
education spending                         7.1264
superfund right to sue                     5.7471
crime                                      3.9080
duty free exports                          6.4368
export administration act south africa    23.9080
dtype: float64

In [0]:
df.drop(columns=['water project cost sharing', 'export administration act south africa'], inplace=True)

In [0]:
df['duty free exports'].value_counts()

False    233
True     174
Name: duty free exports, dtype: int64

In [0]:
na_features = ((df.isna().sum() / len(df)) * 100)[1:].index

no_na = df.dropna()

no_na['Class Name'] = no_na['Class Name'].map({'democrat': 0, 'republican': 1}).astype(int)

no_na.shape

(312, 15)

312 after drop all na

In [0]:
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression

models = {}

for idx, feat in enumerate(no_na.columns):
    X = no_na.drop(feat, axis=1).astype(int)
    y = no_na[feat].astype(int)
    logit = LogisticRegression(solver='lbfgs').fit(X,y)
    print(feat," - ", logit.score(X,y))
    models[feat] = logit

Class Name  -  0.9775641025641025
handicapped infants  -  0.7532051282051282
adoption of the budget resolution  -  0.8878205128205128
physician fee freeze  -  0.9615384615384616
el salvador aid  -  0.9743589743589743
religious groups in schools  -  0.8878205128205128
anti satellite test ban  -  0.8782051282051282
aid to nicaraguan contras  -  0.9326923076923077
mx missile  -  0.9134615384615384
immigration  -  0.5897435897435898
synfuels corporation cutback  -  0.7211538461538461
education spending  -  0.9102564102564102
superfund right to sue  -  0.8301282051282052
crime  -  0.8846153846153846
duty free exports  -  0.8141025641025641


interesting that class name can be predicted with a score of .977, but voted on immigration is just basically guessing.

In [0]:
copy = df.copy()
copy['Class Name'] = copy['Class Name'].map({'democrat': 0, 'republican': 1}).astype(int)

In [0]:
for idx, row in copy.iterrows():
    empties = np.where(row.isna() == True)[0]
    if len(empties) == 1:
        for index in empties:
            feat = row.index[index]
            X = row.drop(feat).astype(int)
            model = models[feat]
            predicted = model.predict(np.array(X).reshape(1,-1))[0]
            copy.iloc[idx] = predicted

In [0]:
(copy.isna().sum() / len(copy)) * 100

Class Name                           0.0000
handicapped infants                  2.2989
adoption of the budget resolution    2.0690
physician fee freeze                 2.5287
el salvador aid                      2.7586
religious groups in schools          1.6092
anti satellite test ban              2.7586
aid to nicaraguan contras            2.2989
mx missile                           2.0690
immigration                          1.3793
synfuels corporation cutback         3.2184
education spending                   4.1379
superfund right to sue               3.6782
crime                                3.2184
duty free exports                    3.4483
dtype: float64

In [0]:
(df.isna().sum() / len(df)) * 100

Class Name                           0.0000
handicapped infants                  2.7586
adoption of the budget resolution    2.5287
physician fee freeze                 2.5287
el salvador aid                      3.4483
religious groups in schools          2.5287
anti satellite test ban              3.2184
aid to nicaraguan contras            3.4483
mx missile                           5.0575
immigration                          1.6092
synfuels corporation cutback         4.8276
education spending                   7.1264
superfund right to sue               5.7471
crime                                3.9080
duty free exports                    6.4368
dtype: float64

In [0]:
congess = copy.dropna()

so after reading the docs, i think the best way to handle the "?" values would be to keeo the original dataframe and just conditionally delete the features i don't need at that given time right? because it's not missing, it's just that the features aren't binary, there's 3 choices, yes, no, 

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/\
voting-records/house-votes-84.data', names=columns)

# because ? doenst mean missing, it means other
df.replace({'n': -1, "y": 1, "?": 0}, inplace=True)

df.head()

Unnamed: 0,Class Name,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
0,republican,-1,1,-1,1,1,1,-1,-1,-1,1,0,1,1,1,-1,1
1,republican,-1,1,-1,1,1,1,-1,-1,-1,-1,-1,1,1,1,-1,0
2,democrat,0,1,1,0,1,1,-1,-1,-1,-1,1,-1,1,1,-1,-1
3,democrat,-1,1,1,-1,0,1,-1,-1,-1,-1,1,-1,1,-1,-1,1
4,democrat,1,1,1,-1,1,1,-1,-1,-1,-1,1,0,1,1,1,1


In [0]:
grouped = df.groupby('Class Name')
for feat in df.columns[1:]:
    dem_counts =grouped[feat].value_counts()[:3]
    rep_counts =grouped[feat].value_counts()[3:]
    print(feat)
    print("reps yes ratio: ", (rep_counts[1] / rep_counts.sum()) * 100)
    print("dems yes ratio: ", (dem_counts[1] / dem_counts.sum()) * 100)
    print("\n")

handicapped infants
reps yes ratio:  18.452380952380953
dems yes ratio:  38.20224719101123


water project cost sharing
reps yes ratio:  43.452380952380956
dems yes ratio:  44.569288389513105


adoption of the budget resolution
reps yes ratio:  13.095238095238097
dems yes ratio:  10.861423220973784


physician fee freeze
reps yes ratio:  1.7857142857142856
dems yes ratio:  5.2434456928838955


el salvador aid
reps yes ratio:  4.761904761904762
dems yes ratio:  20.59925093632959


religious groups in schools
reps yes ratio:  10.119047619047619
dems yes ratio:  46.06741573033708


anti satellite test ban
reps yes ratio:  23.214285714285715
dems yes ratio:  22.09737827715356


aid to nicaraguan contras
reps yes ratio:  14.285714285714285
dems yes ratio:  16.853932584269664


mx missile
reps yes ratio:  11.30952380952381
dems yes ratio:  22.47191011235955


immigration
reps yes ratio:  43.452380952380956
dems yes ratio:  46.441947565543074


synfuels corporation cutback
reps yes ratio:  12

In [0]:
rep = df.loc[df['Class Name'] == 'republican']
dem = df.loc[df['Class Name'] == 'democrat']

In [0]:
rep.shape, dem.shape

((168, 17), (267, 17))

In [0]:
rep.mean().sort_values(ascending=False)

physician fee freeze                      0.9583
crime                                     0.9226
el salvador aid                           0.8869
religious groups in schools               0.7857
education spending                        0.6845
superfund right to sue                    0.6786
export administration act south africa    0.2738
immigration                               0.1131
water project cost sharing                0.0119
anti satellite test ban                  -0.5000
handicapped infants                      -0.6131
aid to nicaraguan contras                -0.6488
synfuels corporation cutback             -0.6964
adoption of the budget resolution        -0.7143
mx missile                               -0.7560
duty free exports                        -0.7619
dtype: float64

In [0]:
dem.mean().sort_values()

physician fee freeze                     -0.8652
education spending                       -0.6629
el salvador aid                          -0.5431
superfund right to sue                   -0.3970
crime                                    -0.2884
immigration                              -0.0562
religious groups in schools              -0.0449
water project cost sharing                0.0037
synfuels corporation cutback              0.0112
handicapped infants                       0.2022
duty free exports                         0.2584
mx missile                                0.4794
anti satellite test ban                   0.5281
export administration act south africa    0.6030
aid to nicaraguan contras                 0.6479
adoption of the budget resolution         0.7566
dtype: float64

In [0]:
means = pd.DataFrame({'Rep': rep.mean(), 'Dem': dem.mean()})
means

Unnamed: 0,Rep,Dem
handicapped infants,-0.6131,0.2022
water project cost sharing,0.0119,0.0037
adoption of the budget resolution,-0.7143,0.7566
physician fee freeze,0.9583,-0.8652
el salvador aid,0.8869,-0.5431
religious groups in schools,0.7857,-0.0449
anti satellite test ban,-0.5,0.5281
aid to nicaraguan contras,-0.6488,0.6479
mx missile,-0.756,0.4794
immigration,0.1131,-0.0562


In [0]:
reps_support = (means['Rep'] - means['Dem']).sort_values(ascending=False)[:3].index
reps_support

Index(['physician fee freeze', 'el salvador aid', 'education spending'], dtype='object')

In [0]:
dems_support = (means['Dem'] - means['Rep']).sort_values(ascending=False)[:3].index
dems_support

Index(['adoption of the budget resolution', 'aid to nicaraguan contras',
       'mx missile'],
      dtype='object')

In [0]:
modes = pd.concat([rep.mode(), dem.mode()])

### Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01



In [0]:
# dems support

feat_pvals = {}

for feat in dems_support:
    _, pvalue = ttest_ind(rep[feat], dem[feat])
    feat_pvals[feat] = format(pvalue, '.50f')

feat_pvals

{'adoption of the budget resolution': '0.00000000000000000000000000000000000000000000000000',
 'aid to nicaraguan contras': '0.00000000000000000000000000000000000000000000000000',
 'mx missile': '0.00000000000000000000000000000000000000000000004863'}

In [0]:
# reps supoprt

feat_pvals = {}

for feat in reps_support:
    _, pvalue = ttest_ind(rep[feat], dem[feat])
    feat_pvals[feat] = format(pvalue, '.50f')

feat_pvals

{'education spending': '0.00000000000000000000000000000000000000000000000000',
 'el salvador aid': '0.00000000000000000000000000000000000000000000000000',
 'physician fee freeze': '0.00000000000000000000000000000000000000000000000000'}

In [0]:
# test them all?

feat_pvals = {}

for feat in df.columns[1:]:
    tstat, pvalue = ttest_ind(rep[feat], dem[feat])
    feat_pvals[feat] = format(pvalue, '.5f')

sorted_d = sorted(feat_pvals.items(), key=lambda x: x[1], reverse=True)

sorted_d

[('water project cost sharing', '0.93020'),
 ('immigration', '0.08345'),
 ('handicapped infants', '0.00000'),
 ('adoption of the budget resolution', '0.00000'),
 ('physician fee freeze', '0.00000'),
 ('el salvador aid', '0.00000'),
 ('religious groups in schools', '0.00000'),
 ('anti satellite test ban', '0.00000'),
 ('aid to nicaraguan contras', '0.00000'),
 ('mx missile', '0.00000'),
 ('synfuels corporation cutback', '0.00000'),
 ('education spending', '0.00000'),
 ('superfund right to sue', '0.00000'),
 ('crime', '0.00000'),
 ('duty free exports', '0.00000'),
 ('export administration act south africa', '0.00000')]

In [0]:
# i wanna test chaning y/n data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/\
voting-records/house-votes-84.data', names=columns)

# because ? doenst mean missing, it means other
df.replace({'n': 1, "y": 2, "?": 3}, inplace=True)

df.head()

Unnamed: 0,Class Name,handicapped infants,water project cost sharing,adoption of the budget resolution,physician fee freeze,el salvador aid,religious groups in schools,anti satellite test ban,aid to nicaraguan contras,mx missile,immigration,synfuels corporation cutback,education spending,superfund right to sue,crime,duty free exports,export administration act south africa
0,republican,1,2,1,2,2,2,1,1,1,2,3,2,2,2,1,2
1,republican,1,2,1,2,2,2,1,1,1,1,1,2,2,2,1,3
2,democrat,3,2,2,3,2,2,1,1,1,1,2,1,2,2,1,1
3,democrat,1,2,2,1,3,2,1,1,1,1,2,1,2,1,1,2
4,democrat,2,2,2,1,2,2,1,1,1,1,2,3,2,2,2,2


In [0]:
rep = df.loc[df['Class Name'] == 'republican']
dem = df.loc[df['Class Name'] == 'democrat']

In [0]:
means = pd.DataFrame({'Rep': rep.mean(), 'Dem': dem.mean()})
means

Unnamed: 0,Rep,Dem
handicapped infants,1.2202,1.6517
water project cost sharing,1.6845,1.6592
adoption of the budget resolution,1.1786,1.9176
physician fee freeze,2.006,1.1124
el salvador aid,1.9702,1.2959
religious groups in schools,1.9107,1.5281
anti satellite test ban,1.3036,1.809
aid to nicaraguan contras,1.2738,1.8464
mx missile,1.1488,1.8464
immigration,1.5833,1.4944


In [0]:
reps_support = (means['Rep'] - means['Dem']).sort_values(ascending=False)[:3].index
dems_support = (means['Dem'] - means['Rep']).sort_values(ascending=False)[:3].index

In [0]:
feat_pvals = {}

for feat in dems_support:
    _, pvalue = ttest_ind(rep[feat], dem[feat])
    feat_pvals[feat] = format(pvalue, '.50f')

feat_pvals

{'adoption of the budget resolution': '0.00000000000000000000000000000000000000000000000000',
 'aid to nicaraguan contras': '0.00000000000000000000000000001077703924527596059076',
 'mx missile': '0.00000000000000000000000000000000000000037496522204'}