# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as st

# Challenge 1 - The `stats` Submodule

This submodule contains statistical functions for conducting hypothesis tests, producing various distributions and other useful tools. Let's examine this submodule using the KickStarter dataset. Load the data using the `ks-projects-201801.csv` file

In [None]:
ks = pd.read_csv("/content/ks-projects-201801.csv")


Now print the `head` function to examine the dataset.

In [None]:
ks.head(1)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95


Import the `mode` function from `scipy.stats` and find the mode of the `country` and `currency` column.

In [None]:
from scipy.stats import mode
# For country
country_mode = ks["country"].mode()
count_country = ks["country"].value_counts()[0]
print("Most frequent country is",country_mode,"\nwith",count_country,"observations")

# For currency
currency_mode = ks["currency"].mode()
count_currency = ks["currency"].value_counts()[0]
print("Most frequent currency is",currency_mode,"\nwith",count_currency,"observations")

The trimmed mean is a function that computes the mean of the data with observations removed. The most common way to compute a trimmed mean is by specifying a percentage and then removing elements from both ends. However, we can also specify a threshold on both ends. The goal of this function is to create a more robust method of computing the mean that is less influenced by outliers. SciPy contains a function called `tmean` for computing the trimmed mean.

In the cell below, import the `tmean` function and then find the 75th percentile of the `goal` column. Compute the trimmed mean between 0 and the 75th percentile of the column. Read more about the `tmean` function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tmean.html#scipy.stats.tmean).

In [None]:
ks.head(1)

In [None]:
from scipy.stats import tmean
# scipy.stats.tmean(a, limits=None (RANGE I GUESS, here 0-75), inclusive=(True, True), axis=None)

Q3 = np.quantile(ks["goal"], 0.75)

# very much of a pain.
q3_data = ks[(ks["goal"] > 0) & (ks["goal"] <= Q3)] ### ks["goal"].min() is the acurate lower bound??

q3_tmean = tmean(q3_data["goal"])
q3_tmean

4874.150287106898

In [None]:
""" Just did this to understand and be sure about previous exercise. Using q4 = total sample, maybe helped to see if I was wrong """
q4_mean = total_mean = ks["goal"].mean()
print(q4_mean)

q4_tmean = tmean(ks["goal"])
print(q4_tmean)

#### SciPy contains various statistical tests. One of the tests is Fisher's exact test. This test is used for contingency tables.

The test originates from the "Lady Tasting Tea" experiment. In 1935, Fisher published the results of the experiment in his book. The experiment was based on a claim by Muriel Bristol that she can taste whether tea or milk was first poured into the cup. Fisher devised this test to disprove her claim. The null hypothesis is that the treatments do not affect outcomes, while the alternative hypothesis is that the treatment does affect outcome. To read more about Fisher's exact test, see:

* [Wikipedia's explanation](http://b.link/test61)
* [A cool deep explanation](http://b.link/handbook47)
* [An explanation with some important Fisher's considerations](http://b.link/significance76)

Let's perform Fisher's exact test on our KickStarter data. We intend to test the hypothesis that the choice of currency has an impact on meeting the pledge goal. We'll start by creating two derived columns in our dataframe. The first will contain 1 if the amount of money in `usd_pledged_real` is greater than the amount of money in `usd_goal_real`. We can compute this by using the `np.where` function. If the amount in one column is greater than the other, enter a value of 1, otherwise enter a value of zero. Add this column to the dataframe and name it `goal_met`.

In [None]:
ks.head(1)

In [None]:
# Use the Fisher's exact test of independence when you have two nominal variables and you want to see whether the proportions of one variable are different
# depending on the value of the other variable. Use it when the sample size is small.

# N0: treatments do not affect outcomes
# N1: treatments does affect outcomes

In [None]:
# The first will contain 1 if the amount of money in usd_pledged_real is greater than the amount of money in usd_goal_real.
# We can compute this by using the np.where function. ----- > numpy.where(condition, x, y) < ----- love this s*it

ks["goal_met"] = np.where(ks["usd_pledged_real"] > ks["usd_goal_real"], 1, 0)

Next, create a column that checks whether the currency of the project is in US Dollars. Create a column called `usd` using the `np.where` function where if the currency is US Dollars, assign a value of 1 to the row and 0 otherwise.

In [None]:
ks["usd"] = np.where(ks["currency"] == "USD", 1, 0)

Now create a contingency table using the `pd.crosstab` function in the cell below to compare the `goal_met` and `usd` columns.

Import the `fisher_exact` function from `scipy.stats` and conduct the hypothesis test on the contingency table that you have generated above. You can read more about the `fisher_exact` function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact). The output of the function should be the odds ratio and the p-value. The p-value will provide you with the outcome of the test.

In [None]:
from scipy.stats import fisher_exact

# From: 04_lab_hypothesis_testing_1
cont_table = pd.crosstab(ks["goal_met"], ks["usd"])

# Syntax: scipy.stats.fisher_exact(table, alternative='two-sided')
st.fisher_exact(cont_table, alternative = "two-sided")

SignificanceResult(statistic=1.3791838163150314, pvalue=2.994e-320)

In [None]:
# N0: currency in USD does not affect goal
# N1: currency in USD DOES affect goal

# Am I reading this right?

p_value = st.fisher_exact(cont_table, alternative = "two-sided")[1]

if p_value > 0.05:
  print("I can not reject the null hypothesis")
else:
  print("We can reject the null hypothesis")

# this p-value is the biggest 0 ever, (but still) means that USD indeed affects the goal, probably if the majoraty of transactions(profit?) is in USD

# Challenge 2 - The `interpolate` submodule

This submodule allows us to interpolate between two points and create a continuous distribution based on the observed data.

In the cell below, import the `interp1d` function from `scipy.interpolate` and first take a sample of 10 rows from `kickstarter`.

**Make sure there are not duplicated values in `backers`**

In [None]:
from scipy.interpolate import interp1d

# Make sure there are not duplicated values in backers:
ks["backers"] = ks["backers"].drop_duplicates()

sample_10 = ks.sample(10)
sample_10

Next, create a linear interpolation of `usd_pledged_real` as a function of the `backers`. Create a function `f` that generates a linear interpolation of `usd_pledged_real` as predicted by the amount of `backers`.

In [None]:
# Not this one, sorry.

Now create a new variable called `x_new`. This variable will contain all integers between the minimum number of backers in our sample and the maximum number of backers. The goal here is to take the dataset that contains few obeservations due to sampling and fill all observations with a value using the interpolation function.

Hint: one option is the `np.arange` function.

Plot function `f` for all values of `x_new`. Run the code below.

In [None]:
# Not this one, sorry.

%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(x_new, f(x_new))

Next create a function that will generate a cubic interpolation function. Name the function `g`.

In [None]:
# Not this one, sorry.



In [None]:
# Not this one, sorry.

plt.plot(x_new, g(x_new))

# Bonus Challenge - The Binomial Distribution

The binomial distribution allows us to calculate the probability of k successes in n trials for a random variable with two possible outcomes (which we typically label success and failure).  

The probability of success is typically denoted by p and the probability of failure is denoted by 1-p.

The `scipy.stats` submodule contains a `binom` function for computing the probabilites of a random variable with the binomial distribution. You may read more about the binomial distribution [here](http://b.link/binomial55)

* In the cell below, compute the probability that a dice lands on 5 exactly 3 times in 8 tries.


In [None]:
# Your code here:



* Do a simulation for the last event: do a function that simulate 8 tries and return a 1 if the result is 5 exactly 3 times and 0 if not. Now launch your simulation.

In [None]:
# Your code here:


* Launch 10 simulations and represent the result in a bar plot. Now launch 1000 simulations and represent it. What do you see?

In [None]:
# Your code here:
