<a href="https://colab.research.google.com/github/seismosmsr/machine_learning/blob/main/kaggle_api_in_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing the [Kaggle API](https://github.com/Kaggle/kaggle-api) in Colab

Just to ensure we've got our requirements met. This also works if you choose to run things on the kaggle back end (code for pushing notebook to auto submit from kaggle at bottom of this script).

In [7]:
!pip install kaggle
!conda install -y gdown

/bin/bash: conda: command not found


# Authenticating with Kaggle using kaggle.json

Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials.

Then run the cell below to upload kaggle.json to your Colab runtime.

In [8]:
# from google.colab import files
import gdown
#This is my personal kaggle json. If you run this and don't switch it out, you'll be running as me.
!gdown --id 1sD1x-nf2nXNNDFD3zdPKvOKM2wGSWcSP
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Downloading...
From: https://drive.google.com/uc?id=1sD1x-nf2nXNNDFD3zdPKvOKM2wGSWcSP
To: /content/kaggle.json
  0% 0.00/69.0 [00:00<?, ?B/s]100% 69.0/69.0 [00:00<00:00, 97.9kB/s]


# Using the Kaggle API

For a more complete list of what you can do with the API, visit https://github.com/Kaggle/kaggle-api.

## Downloading a dataset

In [11]:
!kaggle competitions download -c covid-19-risk-2022

train_small.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
example_submission.ipynb: Skipping, found more recently modified local copy (use --force to force download)
User cancelled operation


Unzip the data and take a first glance.

In [12]:
!unzip train.csv.zip

!unzip test.csv.zip

!unzip train_small.csv.zip

Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                
Archive:  train_small.csv.zip
  inflating: train_small.csv         


In [13]:
import pandas as pd 

train_small = pd.read_csv('/content/train_small.csv')
# train = pd.read_csv('/content/train.csv')
# test = pd.read_csv('/content/test.csv')
train_small.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
0,2021-09,NY,36.0,BRONX,36005.0,0 - 17 years,Male,,,0.0,,Missing,,1,0.0,,,,
1,2021-09,CA,6.0,SAN JOAQUIN,6077.0,18 to 49 years,Male,,,,,Missing,,1,,,,,
2,2021-09,MA,25.0,MIDDLESEX,25017.0,0 - 17 years,Female,Missing,Unknown,,,Missing,,1,,,,,
3,2021-09,PA,42.0,ERIE,42049.0,65+ years,Male,White,Non-Hispanic/Latino,0.0,0.0,Missing,1.0,1,1.0,0.0,0.0,0.0,
4,2021-09,CA,6.0,KERN,6029.0,18 to 49 years,Male,Unknown,Unknown,,,Missing,,1,,,,,


So we've got a case_month, res_state, state_fips_code, res_county, county_fips_code, age_group, sex, race, ethnicity, lots more stuff. Some issues I can think of right away and which we've gone over in class, are the completeness of the data. A couple of approaches I've taken to these issues in the past are usually types of imputation. My initial thinking was to just randomly replace missing data with valid data. Another option would be to use clustering or random forest to create 'informed' imputed postiions. A third option would be to just not use the missing data and to try and maybe group 'types' of missing data together, then maximise the training set by 'grouping' the classes. 

Any which way, the method that we use to deal with NA's is going to be important. If we use a pure random imputation, this ..may.. be able to keep us unbiased (makes some assumptions about underlying distribution of missing data), but we definately stand a chance of losing information if we can't some how randomize the way the data is missing. One option might be to instead of only randomly imputing to that data once, we could do it many many times. If we do this many times, we could actually re-use some of our training data. We might think of this as similar to 'fuzzing' our data in deep learning (basic data augmetation).

If we use a systematic approach (like train an algorithm to impute for us), one issue we'll face is that it could bias our new 'imputed' training dataset. What I mean by this, is that our NA's may not be randomly distributed throughout our training dataset. Some hospitals or agencys may get some data and not others. Maybe some states are better or worse at their quality of reporting. If we rely on an algorthmic approach, we'll effectively be baking whatever biases exist in the meta context of our data into our data. We don't neccessarily want to do that.

Finally, we could also find a way to 'ignore' NA's while losing as little data as possible. In this approach, we could look at where data is and where data isnt, and see if we can find 'groups' of data that are missing some data, but not others. We could try doing this in a column-wise, row-wise, or column-by-row wise manner. Basically, we can see if we can find from the presense of missing data, some meta groups where most of the data is in-tact. Whichever approach we take, we would end up on different column/ row combinations of data and would be training multiple models on the back and that can then be reconfigured into an ensemble model.

I'm planning on focusing primarily on the data clean-up and filtering to get improvements on performance in this exercise.

Read in the data

In [65]:
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


So lets take a look at our data.

In [41]:
train.isnull().describe()

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
count,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855
unique,1,2,2,2,2,2,2,2,2,2,2,1,2,1,2,2,2,2,2
top,False,False,False,False,False,False,False,False,False,True,True,False,True,False,True,True,True,True,True
freq,36225855,36224934,36224934,33728416,33728416,35828300,35136242,30504585,29486359,23875686,18413918,36225855,33650024,36225855,18782425,19133076,34299032,22830993,34127479


So even in our first row we can see what the degree of missing data looks like. The first row is missing information on age, sex, race, ethnicity. Not training on these data would be less than desireable.

In [39]:
train.iloc[:1]

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
0,2020-01,NY,36.0,ONEIDA,36065.0,,,,,0.0,,Missing,,1,,,,0.0,


So if we look for null data, we find that most rows are missing at least some data. A minimally complete list would be composed of 'case_month', 'process', 'labconfirmed_yn'. Even our response variable, 'death_yn' is missing data.

In [43]:
train.isnull().describe()

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
count,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855
unique,1,2,2,2,2,2,2,2,2,2,2,1,2,1,2,2,2,2,2
top,False,False,False,False,False,False,False,False,False,True,True,False,True,False,True,True,True,True,True
freq,36225855,36224934,36224934,33728416,33728416,35828300,35136242,30504585,29486359,23875686,18413918,36225855,33650024,36225855,18782425,19133076,34299032,22830993,34127479


We'll also want to deal with other issues the data may have, specifically, there are several rows that have 'confusing' or confused data. We should also take the time to standarsize our data, but also set any very uncommon values to NA where appropriate. Effectively, we neeed to define a data dictionary to support us through the rest of this investigation.

In [None]:
# for i in train.columns:
#   print('Unique items '+i)
#   print(train[i].unique())

So I'm going to ignore the correctness of County and State. I still want to use these data, but I can't know if spellings are all 100% correct, or at least, I don't think its worthe the time. Age group seems pretty well formed. The 'sex' column mayneed some work, but we can see that race has the same classes for alternative ways data couold be missing. We'll have to make a choice as to how to handle 'true' missing data.

Other columns with issues appear to be exposure_yn, lab_confirmed_yn, symptomatic_yn, basically all of the columns. Some of the issues are that they arent stanadardized into either float, string, or booleans. Also, there are some mixes and some text 'nulls' peppered in there. We'll need to address all of these before procceding.

Another consideration as part of this process, is that this NA processing and filtering has to happen on both our test and validation sets. We'll need to keep this in mind in how we implement our solution.


In [67]:
#This is a weird one because we know they were all exposed, so not-knowing not
#really a nan

train['exposure_yn'][train['exposure_yn'].isnull()] = 0.0
train['exposure_yn'][train['exposure_yn']== 1] = 1.0
# train['exposure_yn'] = train['exposure_yn'].astype('float')


train['labconfirmed_yn'][train['labconfirmed_yn']== 1] = 1.0
train['labconfirmed_yn'][train['labconfirmed_yn']== 0] = 0.0
# train['labconfirmed_yn'] = train['labconfirmed_yn'].astype('float')

train['symptomatic_yn'][train['symptomatic_yn']== '1'] = 1.0
train['symptomatic_yn'][train['symptomatic_yn']== '0'] = 0.0
# train['symptomatic_yn'] = train['symptomatic_yn'].astype('float')



train['hosp_yn'][train['hosp_yn']== 1] = 1.0
train['hosp_yn'][train['hosp_yn']== 0] = 0.0
# train['hosp_yn'] = train['hosp_yn'].astype('float')

train['icu_yn'][train['icu_yn']== '1'] = 1.0
train['icu_yn'][train['icu_yn']== '0'] = 0.0
train['icu_yn'][train['icu_yn']== 'nul'] = np.nan
# train['icu_yn'] = train['icu_yn'].astype('float')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slic

In [68]:
for i in train.columns:
  print('Unique items '+i)
  print(train[i].unique())

Unique items case_month
['2020-01' '2020-02' '2020-03' '2020-04' '2020-05' '2020-06' '2020-07'
 '2020-08' '2020-09' '2020-10' '2020-11' '2020-12' '2021-01' '2021-02'
 '2021-03' '2021-04' '2021-05' '2021-06' '2021-07' '2021-08' '2021-09']
Unique items res_state
['NY' 'NC' 'NJ' 'IA' 'GA' 'NV' 'TX' 'FL' 'CA' 'TN' 'SC' 'UT' 'MO' 'WI'
 'OH' 'WA' 'MI' nan 'CO' 'CT' 'IN' 'MA' 'PR' 'MD' 'AL' 'ME' 'SD' 'AZ' 'KY'
 'NM' 'KS' 'NE' 'PA' 'VA' 'IL' 'DC' 'LA' 'AR' 'MS' 'OR' 'MN' 'VT' 'MT'
 'ID' 'AK' 'OK' 'HI' 'NH' 'ND' 'WY' 'RI' 'DE' 'WV' 'VI' 'GU']
Unique items state_fips_code
[36. 37. 34. 19. 13. 32. 48. 12.  6. 47. 45. 49. 29. 55. 39. 53. 26. nan
  8.  9. 18. 25. 72. 24.  1. 23. 46.  4. 21. 35. 20. 31. 42. 51. 17. 11.
 22.  5. 28. 41. 27. 50. 30. 16.  2. 40. 15. 33. 38. 56. 44. 10. 54. 78.
 66.]
Unique items res_county
['ONEIDA' nan 'MONMOUTH' ... 'VILAS' 'WOODWARD' 'CALEDONIA']
Unique items county_fips_code
[36065.    nan 34025. ... 50019. 50005. 50017.]
Unique items age_group
[nan '65+ years' '18

Ok, at least now our data is some what standardized and we can look for clustering among our NA. So that we can remember what tha looks like:

In [69]:
train.isnull().describe()

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
count,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855,36225855
unique,1,2,2,2,2,2,2,2,2,2,2,1,1,1,2,2,2,2,2
top,False,False,False,False,False,False,False,False,False,True,True,False,False,False,True,True,True,True,True
freq,36225855,36224934,36224934,33728416,33728416,35828300,35136242,30504585,29486359,23875686,18413918,36225855,36225855,36225855,18782425,19133076,34299459,22830993,34127479


We'll need to convert classes to booleans, so we're going to drop some redundant columns

In [70]:
train.columns

Index(['case_month', 'res_state', 'state_fips_code', 'res_county',
       'county_fips_code', 'age_group', 'sex', 'race', 'ethnicity',
       'case_positive_specimen_interval', 'case_onset_interval', 'process',
       'exposure_yn', 'labconfirmed_yn', 'symptomatic_yn', 'hosp_yn', 'icu_yn',
       'death_yn', 'underlying_conditions_yn'],
      dtype='object')

In [79]:
na_cluster_test = train[['case_month', 'res_state', 'res_county',
       'age_group', 'sex', 'race', 'ethnicity',
       'case_positive_specimen_interval', 'case_onset_interval', 'process',
       'exposure_yn', 'labconfirmed_yn', 'symptomatic_yn', 'hosp_yn', 'icu_yn',
       'death_yn', 'underlying_conditions_yn']]

na_cluster_test = train.isnull()
na_cluster_test = na_cluster_test.astype(int)

In [92]:
from sklearn.cluster import KMeans
kmeans = KMeans(5)
kmeans.fit(na_cluster_test)
identified_clusters = kmeans.fit_predict(na_cluster_test)
identified_clusters

array([3, 3, 3, ..., 0, 4, 1], dtype=int32)

So what we're looking for here is any strong or weak groupings where its almost all one or the other, effectively, autocorrelation between NA's in some columns.

In [93]:
na_cluster_test.groupby(identified_clusters).sum() / na_cluster_test.sum()

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,labconfirmed_yn,symptomatic_yn,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
0,,0.0,0.0,0.365408,0.365408,0.0,0.0,0.0,0.049051,0.357101,0.533693,,,,0.565109,0.491611,0.310726,0.467776,0.311171
1,,0.0,0.0,0.228932,0.228932,0.0,0.0,0.0,0.053112,0.380993,0.016037,,,,0.028304,0.119302,0.252687,0.216703,0.255061
2,,0.0,0.0,0.109684,0.109684,0.0,0.0,0.0,0.018612,0.086819,0.266843,,,,0.250026,0.131166,0.141589,0.007143,0.142212
3,,1.0,1.0,0.110973,0.110973,1.0,1.0,0.999929,0.853919,0.175088,0.160272,,,,0.156561,0.140902,0.159607,0.171073,0.159393
4,,0.0,0.0,0.185004,0.185004,0.0,0.0,7.1e-05,0.025305,0.0,0.023154,,,,0.0,0.117019,0.135391,0.137304,0.132163


Whats interesting here, is that most of the clusters are pretty evenly split. However, there appears to be one cluster which sticks out and contains all the NA's from res_state, state_fips codes, age, sex, most of race, and most of ethnicity. Most of the other columns are more spread out between the clusters. It would like make sense then to split our modeling. We can make one model that predicts based on res_state, state_fips codes, age, sex, most of race, and ethnicity, since we can expect most of those columns to be complete where the others are complete. We can then place the rest of the columns in another model. In this way we can minimize the total amount of imputation we have to rely on, should we choose to rely on that. We can also consider not doing imputation and just running the two models where we have data.

We'll need a method for imputting random features into the dataset

In [95]:
def randomiseMissingData(df):
    import random
    "randomise missing data for DataFrame (within a column)"
    # df = df2.copy()
    for col in df.columns:
        data = df[col]
        mask = data.isnull()
        samples = random.choices( data[~mask].values , k = mask.sum() )
        df.loc[mask,col] = samples
    return df

In [107]:
train.columns

Index(['case_month', 'res_state', 'state_fips_code', 'res_county',
       'county_fips_code', 'age_group', 'sex', 'race', 'ethnicity',
       'case_positive_specimen_interval', 'case_onset_interval', 'process',
       'exposure_yn', 'labconfirmed_yn', 'symptomatic_yn', 'hosp_yn', 'icu_yn',
       'death_yn', 'underlying_conditions_yn'],
      dtype='object')

In [115]:
small_cluster_train = train[identified_clusters != 3]
small_cluster_train = small_cluster_train[['case_month','res_state','age_group','sex','race','ethnicity']]
small_cluster_train	 = randomiseMissingData(small_cluster_train)
small_cluster_train = pd.get_dummies(small_cluster_train)	 
small_cluster_train = small_cluster_train[train['death_yn'].notnull()]

  """


In [119]:
y = train[identified_clusters != 3]['death_yn'][train[identified_clusters != 3]['death_yn'].notnull()]

In [120]:
print(len(y))
print(len(small_cluster_train))

11545647
11545647


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,
                        random_state=1)

bag.fit(small_cluster_train, y)

In [None]:
fill_test = randomiseMissingData(test)
# test_X = fill_test.drop(['death_yn'], inplace=False, axis=1)
test_X = pd.get_dummies(fill_test)


In [None]:
bag.predict(test_X)

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)

In [None]:
y

In [None]:

df["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)


In [None]:
train_small["column"].fillna(lambda x: random.choice(df[df[column] != np.nan]["column"]), inplace =True)


In [None]:
print(train_na)

In [None]:
test_na = test.dropna(axis = 0, how = 'any', inplace = True)

In [None]:
print(test_na)

So you can immediately notice some pretty big differences between the two dataset. Firstly, more counties and states in training than test. Probably a true random sample (which could mean that some rare counties or combinations of groups don't exist in the validation set). Same with sex, pretty big difference between the two. However thats a little weirder because there so many. The two ratios should be much closer than that.

In [None]:
# Example code for training model and creating submission file.
# Author: Peter Sadowski Jan 22 2022
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

# Load training data.
df_train = pd.read_csv('./train_small.csv.zip') # Can read from zip files directly.
df_train = df_train.replace({'death_yn':{np.nan:0}}) # Assume no info means survived.
y = df_train['death_yn']

# Load test data.
df_test = pd.read_csv('./test.csv.zip')

# Encode state variable as one-hot. 
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(df_train[['res_state']])
state_train = enc.transform(df_train[['res_state']])
state_test = enc.transform(df_test[['res_state']])

# Combine this with whether patient went to ICU.
X = np.concatenate([df_train[['icu_yn']]==1, state_train], axis=1)
X_test = np.concatenate([df_test[['icu_yn']] == 1, state_test], axis=1)

# Make predictions based on whether patient went to ICU, and their state.
model = LogisticRegression()
model.fit(X,y)
ypred = model.predict_proba(X_test)[:,1]
print(f'Model coefficients: {model.coef_}')

# Create submission file.
submission = pd.DataFrame(ypred, columns=['prediction']) # Create new dataframe.
submission['Id'] = submission.index  # Kaggle expects two columns: Id, prediction.
submission.to_csv('sample_submission.csv', index=False)

import matplotlib.pylab as plt
plt.hist(ypred, bins=100);

In [None]:
submission
!kaggle competitions submit covid-19-risk-2022 -f sample_submission.csv -m 'Heres Johnny'


## Uploading a Colab notebook to Kaggle Kernels

Bear with us, as this is a little round-about...

### Downloading a notebook from Colab

To download from Colab, use **File** | **Download .ipynb**

In [None]:
# user = "ics435"
# repo = "ps1-numpy-seismosmsr"
# src_dir = "master"
# pyfile = "kaggle_api_in_colab.ipynb"
# raw_git = 'https://raw.githubusercontent.com/seismosmsr/machine_learning/main/kaggle_api_in_colab.ipynb'

# url = f"{raw_git}"

# !wget --no-cache --backups=1 {url} -o submission.ipynb

### Then upload the notebook to your Colab runtime

In [None]:
# # uploaded = files.upload()
# notebook_path = '/content/kaggle_api_in_colab.ipynb'

In [None]:
# uploaded = files.upload()
# notebook_path = list(uploaded.keys())[0]

In [None]:
# !mkdir -p export
# !mv $notebook_path export/
# !kaggle kernels init -p export

In [None]:
# import re
# import random
# your_kaggle_username = 'Aron Boettcher'
# notebook_title = 'Test Kernel ' + str(random.randint(1,100))
# new_kernel_slug = re.sub(r'[^a-z0-9]+', '-', notebook_title.lower())
# notebook_path = 'kaggle_api_in_colab.ipynb'

In [None]:
# # Documented here: https://github.com/Kaggle/kaggle-api/wiki/Kernel-Metadata
# metadata = '''
# {
#   "id": "%s/%s",
#   "title": "%s",
#   "code_file": "%s",
#   "language": "python",
#   "kernel_type": "notebook",
#   "is_private": "true",
#   "enable_gpu": "false",
#   "enable_internet": "true",
#   "dataset_sources": [],
#   "competition_sources": [],
#   "kernel_sources": []
# }
# ''' % (your_kaggle_username, new_kernel_slug, notebook_title, notebook_path)

In [None]:
# !echo '$metadata' > export/kernel-metadata.json
# !cat export/kernel-metadata.json

In [None]:
# !kaggle kernels push -p export

In [None]:
# the functions:
def stratified_sample(df, strata, size=None, seed=None, keep_index= True):
    '''
    It samples data from a pandas dataframe using strata. These functions use
    proportionate stratification:
    n1 = (N1/N) * n
    where:
        - n1 is the sample size of stratum 1
        - N1 is the population size of stratum 1
        - N is the total population size
        - n is the sampling size
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    :seed: sampling seed
    :keep_index: if True, it keeps a column with the original population index indicator
    
    Returns
    -------
    A sampled pandas dataframe based in a set of strata.
    Examples
    --------
    >> df.head()
    	id  sex age city 
    0	123 M   20  XYZ
    1	456 M   25  XYZ
    2	789 M   21  YZX
    3	987 F   40  ZXY
    4	654 M   45  ZXY
    ...
    # This returns a sample stratified by sex and city containing 30% of the size of
    # the original data
    >> stratified = stratified_sample(df=df, strata=['sex', 'city'], size=0.3)
    Requirements
    ------------
    - pandas
    - numpy
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)

    # controlling variable to create the dataframe or append to it
    first = True 
    for i in range(len(tmp_grpd)):
        # query generator for each iteration
        qry=''
        for s in range(len(strata)):
            stratum = strata[s]
            value = tmp_grpd.iloc[i][stratum]
            n = tmp_grpd.iloc[i]['samp_size']

            if type(value) == str:
                value = "'" + str(value) + "'"
            
            if s != len(strata)-1:
                qry = qry + stratum + ' == ' + str(value) +' & '
            else:
                qry = qry + stratum + ' == ' + str(value)
        
        # final dataframe
        if first:
            stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            first = False
        else:
            tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            stratified_df = stratified_df.append(tmp_df, ignore_index=True)
    
    return stratified_df



def stratified_sample_report(df, strata, size=None):
    '''
    Generates a dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Returns
    -------
    A dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)
    return tmp_grpd


def __smpl_size(population, size):
    '''
    A function to compute the sample size. If not informed, a sampling 
    size will be calculated using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Parameters
    ----------
        :population: population size
        :size: sample size (default = None)
    Returns
    -------
    Calculated sample size to be used in the functions:
        - stratified_sample
        - stratified_sample_report
    '''
    if size is None:
        cochran_n = round(((1.96)**2 * 0.5 * 0.5)/ 0.02**2)
        n = round(cochran_n/(1+((cochran_n -1) /population)))
    elif size >= 0 and size < 1:
        n = round(population * size)
    elif size < 0:
        raise ValueError('Parameter "size" must be an integer or a proportion between 0 and 0.99.')
    elif size >= 1:
        n = size
    return n
