### Statistical analysis and hypothesis testing ###

In this notebook we are going to:
1. Proof or deny some **hypothesis** about to shark attacks related to some of our database info variables after applying different **statistical tests**
2. **Plot** some interesting variables and the test previously done

First thing as always is **import** the libraries that we are going to use for the tasks described above. In this case we import libraries that provides us **plotting** and statistics functions**

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
pd.set_option('display.max_rows', None)

 Then we import the data set that we have previously cleaned and that is now ready

In [2]:
sharks_final_dataset= pd.read_csv("sharks_final_dataset.csv",index_col=0)
sharks_final_dataset.head()

Unnamed: 0,Month,Year,Type,Country,Area,Location,Activity,Name,Gender,Age,Injury,Severity,Hour,Shark specie,link source,Case number
0,May,2018.0,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,Male,52.0,Minor injury to foot. PROVOKED INCIDENT,No death,,Lemon shark,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27
1,May,2018.0,Unprovoked,USA,South Carolina,"Hilton Head Island, Beaufort County",Swimming,Jei Turrell,Male,10.0,Severe bite to right forearm,No death,15h00,,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.13.b
2,Apr,2018.0,Provoked,AUSTRALIA,New South Wales,Lennox Head,Water sports,Matthew Lee,Male,,No injury,No death,07h00,Questionable shark involvement,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.25.b
3,Apr,2018.0,Invalid,BRAZIL,Alagoas,"Praia de Sauaçuhy, Maceió",Fishing,Josias Paz,Male,56.0,Injury to ankle from marine animal trapped in ...,No death,,Questionable shark involvement,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.10.R
4,Apr,2018.0,Unprovoked,NEW CALEDONIA,,"Magenta Beach, Noumea",Water sports,,Unknown,,"No injury, shark bit board",No death,17h00,,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.09


### Chisquare test between activity and severity
In this first test we are trying to proof that there is no relation between the activity practiced in the moment of the attack and the severity of it. 
Our **null hypothesis** states that there is no difference between severity depending on the activity while the **alternative hypothesis** states that exists a relation between the activity practiced and the severity of the attack


In [3]:
stats.chi2_contingency(pd.crosstab(sharks_final_dataset["Activity"], sharks_final_dataset["Severity"]))

(670.2144592196706,
 1.4790083177401616e-127,
 22,
 array([[1.40089786e+02, 4.38736732e+02, 4.91734818e+01],
        [8.92291630e-01, 2.79450148e+00, 3.13206891e-01],
        [1.11536454e+00, 3.49312685e+00, 3.91508613e-01],
        [2.27534366e+01, 7.12597877e+01, 7.98677571e+00],
        [1.29382286e+02, 4.05202714e+02, 4.54149991e+01],
        [2.59656864e+02, 8.13199930e+02, 9.11432052e+01],
        [8.92291630e-01, 2.79450148e+00, 3.13206891e-01],
        [1.33843745e+00, 4.19175222e+00, 4.69810336e-01],
        [3.52455194e+01, 1.10382808e+02, 1.23716722e+01],
        [2.36234209e+02, 7.39844267e+02, 8.29215243e+01],
        [3.97069776e+02, 1.24355316e+03, 1.39377066e+02],
        [5.73297373e+01, 1.79546720e+02, 2.01235427e+01]]))

As **p-value** < **0.05** out of the **confidence interval** we have to **reject** the null hypothesis and accept the alternative. So we can affirm that there is a different severity grade depending on the activity practiced in the moment of the attack

Then we create a **contingency** table with the *perfect* values of the chisquare test and the *real* values that has our dataset. We compare them and see that as the p-value stated, there is a big difference between the *ideal* values and the real ones

In [4]:
pd.DataFrame(stats.chi2_contingency(pd.crosstab(sharks_final_dataset["Activity"], sharks_final_dataset["Severity"]))[-1])

Unnamed: 0,0,1,2
0,140.089786,438.736732,49.173482
1,0.892292,2.794501,0.313207
2,1.115365,3.493127,0.391509
3,22.753437,71.259788,7.986776
4,129.382286,405.202714,45.414999
5,259.656864,813.19993,91.143205
6,0.892292,2.794501,0.313207
7,1.338437,4.191752,0.46981
8,35.245519,110.382808,12.371672
9,236.234209,739.844267,82.921524


In [5]:
pd.crosstab(sharks_final_dataset["Activity"], sharks_final_dataset["Severity"])

Severity,Death,No death,Unknown
Activity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bathing/Floating/Walking near the shore,153,435,40
Clamming,0,3,1
Dangling feet in the water,1,4,0
Direct interaction with a shark,8,92,2
Diving,126,391,63
Fishing,162,931,71
Lifesaving drill,2,2,0
Murder,4,0,2
Sea or air disaster,100,30,28
Swimming,395,555,109


### T-sample mean hypothesis testing (between age and severity)

Second test that we apply is the T-test betwen a categorical variable as severity and a numerical one such as the age of the victim


In [None]:
a = sharks_final_dataset[sharks_final_dataset["Severity"] == "No death"]["Age"]
b = sharks_final_dataset[sharks_final_dataset["Severity"] == "Death"]["Age"]

In [12]:
stats.ttest_ind(a.dropna(), b.dropna())

Ttest_indResult(statistic=0.30528897030694535, pvalue=0.7601658633059136)

As the **p-value** is greater than our confidence interval, we cannot reject the null hypothesis and so we can affirm that  **the severity of the incident don't differ because of the age of the victims** 

In [13]:
#Test anova to check age and country