# Tasca mètodes de mostreig

**Descripció**
Aprèn a realitzar mostreig de les dades amb Python.

### Nivell 1
#### Exercici 1
Agafa un conjunt de dades de tema esportiu que t'agradi. Realitza un mostreig de les dades generant una mostra aleatòria simple i una mostra sistemàtica.

In [1]:
import pandas as pd
import numpy as np
import random

from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Dades escollides

Agafem les dades [mlbBat10.txt](./data-sources/mlbBat10.txt). La descripció dels camps la trobem [aquí](https://www.openintro.org/data/index.php?data=mlbbat10)

Format
A data frame with 1199 observations on the following 19 variables.

|camp|descripció|
|-|-|
|**name**|Player name|
|**team**|Team abbreviation|
|**position**|Player position|
|**G** game|Number of games|
|**AB** at_bat|Number of at bats|
|**R** run|Number of runs|
|**H** hit|Number of hits|
|**2B** double|Number of doubles|
|**3B** triple|Number of triples|
|**HR** home_run|Number of home runs|
|**RBI** rbi|Number of runs batted in|
|**TB** total_base|Total bases, computed as 3HR + 23B + 1*2B + H|
|**BB** walk|Number of walks|
|**SO** strike_out|Number of strikeouts|
|**SB** stolen_base|Number of stolen bases|
|**CS** caught_stealing|Number of times caught stealing|
|**OBP** obp|On base percentage|
|**SLG** slg|Slugging percentage (total_base / at_bat)|
|**AVG** bat_avg|Batting average|

In [2]:
dtypes = {'position':'category', 'team':'category'}
mlb = pd.read_csv('../data-sources/mlbBat10.txt', sep='\t', dtype=dtypes)


##### Mostra Aleatòria Simple

In [3]:
mostra_simple = mlb.sample(frac=0.1)
mostra_simple

Unnamed: 0,name,team,position,G,AB,R,H,2B,3B,HR,RBI,TB,BB,SO,SB,CS,OBP,SLG,AVG
606,J Payton,COL,OF,20,35,3,12,4,1,0,1,18,1,4,1,0,0.361,0.514,0.343
551,B Zito,SF,P,32,51,1,6,0,0,0,2,6,1,17,0,0,0.135,0.118,0.118
1119,J Papelbon,BOS,P,3,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
800,B Clevlen,ATL,OF,4,4,2,1,1,0,0,0,2,0,1,0,0,0.250,0.500,0.250
998,J Broxton,LAD,P,61,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
539,C Volstad,FLA,P,28,54,1,5,1,0,0,2,6,0,28,0,0,0.093,0.111,0.093
1074,J Lewis,CLE,P,3,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
728,R Budde,LAA,C,6,10,2,4,1,0,1,3,8,1,5,0,0,0.455,0.800,0.400
952,J Rincon,COL,P,2,1,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000


##### Mostra Sistemàtica

In [4]:
frac = 0.1
ind = int(mlb.shape[0]/(mlb.shape[0]*frac))
start = np.random.randint(0,ind)
mostra_sistematica = mlb[start::ind]
mostra_sistematica

Unnamed: 0,name,team,position,G,AB,R,H,2B,3B,HR,RBI,TB,BB,SO,SB,CS,OBP,SLG,AVG
5,M Scutaro,BOS,SS,150,632,92,174,38,0,11,56,245,53,71,5,4,0.333,0.388,0.275
15,C McGehee,MIL,3B,157,610,70,174,38,1,23,104,283,50,102,1,1,0.337,0.464,0.285
25,V Guerrero,TEX,DH,152,593,83,178,27,1,29,115,294,35,60,4,5,0.345,0.496,0.300
35,A Pujols,STL,1B,159,587,115,183,39,1,42,118,350,103,76,14,4,0.414,0.596,0.312
45,A Pagan,NYM,OF,151,579,80,168,31,7,11,69,246,44,97,37,9,0.340,0.425,0.290
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1155,F Salas,STL,P,26,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
1165,C Smith,MIL,P,2,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
1175,T Stoner,NYM,P,1,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000
1185,R VandenHurk,FLA,P,1,0,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000


### Nivell 2
#### Exercici 2
Continua amb el conjunt de dades de tema esportiu i genera una mostra estratificada i una mostra utilitzant SMOTE (Synthetic Minority Oversampling Technique).

##### Mostra Estratificada

- Amb pandas. 
Exemple extret de [https://stackoverflow.com/questions/44114463/stratified-sampling-in-pandas](https://stackoverflow.com/questions/44114463/stratified-sampling-in-pandas)

In [5]:
frac=0.1
strat_pd = mlb.groupby('position', group_keys=False).apply(lambda x: x.sample(frac=frac))
strat_pd

Unnamed: 0,name,team,position,G,AB,R,H,2B,3B,HR,RBI,TB,BB,SO,SB,CS,OBP,SLG,AVG
474,K Matsui,COL,-,27,71,4,10,1,0,0,1,11,4,10,1,1,0.197,0.155,0.141
77,D Lee,ATL,1B,148,547,80,142,35,0,19,80,234,73,134,1,3,0.347,0.428,0.260
76,P Konerko,CWS,1B,149,548,89,171,30,1,39,111,320,72,110,0,1,0.393,0.584,0.312
591,M Lamb,FLA,1B,39,38,2,7,1,1,0,4,10,2,6,0,0,0.225,0.263,0.184
659,M Jacobs,NYM,1B,7,24,1,5,1,0,1,2,9,3,7,0,0,0.296,0.375,0.208
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
716,N Green,TOR,SS,9,13,2,2,0,0,0,1,2,1,3,0,0,0.214,0.154,0.154
129,J Bartlett,TB,SS,135,468,71,119,27,3,4,47,164,45,83,11,6,0.324,0.350,0.254
575,Y Navarro,BOS,SS,20,42,4,6,0,0,0,5,6,2,17,0,0,0.174,0.143,0.143
843,A Sanchez,BOS,SS,1,3,0,0,0,0,0,0,0,0,0,0,0,0.000,0.000,0.000


In [6]:
strat_pd_team = mlb.groupby('team', group_keys=False).apply(lambda x: x.sample(frac=frac))
strat_pd_team

Unnamed: 0,name,team,position,G,AB,R,H,2B,3B,HR,RBI,TB,BB,SO,SB,CS,OBP,SLG,AVG
244,M Montero,ARI,C,85,297,36,79,20,2,9,43,130,29,71,0,1,0.332,0.438,0.266
39,K Johnson,ARI,2B,154,585,93,166,36,5,26,71,290,79,148,13,7,0.370,0.496,0.284
110,M Reynolds,ARI,3B,145,499,79,99,17,2,32,85,216,83,211,7,4,0.320,0.433,0.198
307,R Church,ARI,OF,106,219,25,44,16,1,5,25,77,16,65,1,0,0.265,0.352,0.201
267,A Gonzalez,ATL,SS,72,267,27,64,17,2,6,38,103,14,53,0,2,0.291,0.386,0.240
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53,J Bautista,TOR,OF,161,569,109,148,35,3,54,124,351,100,116,9,2,0.378,0.617,0.260
227,C Guzman,WSH,2B,89,319,44,90,11,4,2,25,115,17,53,4,2,0.327,0.361,0.282
758,Y Maya,WSH,P,5,7,0,1,0,0,0,0,1,0,4,0,0,0.143,0.143,0.143
504,L Hernandez,WSH,P,33,61,2,9,1,0,1,3,13,1,13,0,0,0.161,0.213,0.148


##### SMOTE (Synthetic Minority Oversampling Technique)

Com això es beisbol, "sabem" que els equips no tenen balancejades totes les posicions, així que utilitzarem el SMOTE per generar un dataset balancejat.

In [7]:
mlb.position.value_counts()

P     544
OF    226
C     113
2B     72
3B     71
SS     71
1B     69
DH     25
-       8
Name: position, dtype: int64

In [8]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE()
le = LabelEncoder()
le.fit(mlb.team)
mlb['team_enc'] = le.transform(mlb.team)
target = 'position'
cols = [col for col in mlb.columns if col not in ['name','position','team']]

# contador abans del SMOTE
print(f'Posicions abans de SMOTE: \n{Counter(mlb[target])}')

X, y = oversample.fit_resample(mlb[cols],mlb[target])

# contador després del SMOTE
print(f'Posicions després de SMOTE: \n{Counter(y)}')



Posicions abans de SMOTE: 
Counter({'P': 544, 'OF': 226, 'C': 113, '2B': 72, 'SS': 71, '3B': 71, '1B': 69, 'DH': 25, '-': 8})
Posicions després de SMOTE: 
Counter({'OF': 544, 'SS': 544, '3B': 544, '2B': 544, '1B': 544, 'DH': 544, 'C': 544, 'P': 544, '-': 544})


### Nivell 3
#### Exercici 3
Continua amb el conjunt de dades de tema esportiu i genera una mostra utilitzant el mètode Reservoir sampling.

##### Reservoir sampling

In [9]:
k = 100
reservoir = pd.DataFrame(columns=mlb.columns)

for i, element in mlb.iterrows():
    if i+1 <= k:
        reservoir = reservoir.append(element)
    else:
        prob = k/(i+1)
        if random.random() < prob:
            reservoir.iloc[random.choice(range(0,k))] = element

reservoir

Unnamed: 0,name,team,position,G,AB,R,H,2B,3B,HR,RBI,TB,BB,SO,SB,CS,OBP,SLG,AVG,team_enc
0,A Torres,SF,OF,139,507,84,136,43,8,16,63,243,56,128,26,7,0.343,0.479,0.268,24
1,S West,FLA,P,2,3,0,0,0,0,0,0,0,0,2,0,0,0.000,0.000,0.000,10
2,M St. Pierre,DET,C,6,9,1,2,1,0,0,0,3,0,2,0,0,0.222,0.333,0.222,9
3,C Johnson,HOU,3B,94,341,40,105,22,2,11,52,164,15,91,3,0,0.337,0.481,0.308,11
4,R Johnson,LAD,OF,102,202,24,53,11,2,2,15,74,5,50,2,2,0.291,0.366,0.262,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,J Wilson,SEA,SS,108,361,22,82,14,2,2,25,106,14,74,5,0,0.278,0.294,0.227,23
96,T Stauffer,SD,P,32,17,2,3,2,0,0,2,5,2,12,0,0,0.263,0.294,0.176,22
97,J Carroll,LAD,SS,133,351,48,102,15,1,0,23,119,51,64,12,4,0.379,0.339,0.291,14
98,S Smith,COL,OF,133,358,55,88,19,5,17,52,168,35,67,2,1,0.314,0.469,0.246,7


In [10]:

# def gen_stream(data, n):
#     for i in range(0,int(data.shape[0]/n)):
#         stream=data[i*n:(i+1)*n]
#         yield stream
        

# reservoir = pd.DataFrame(columns=mlb.columns)
# for data in gen_stream(mlb, 10):
#     reservoir = reservoir.append(data.sample(1))
# reservoir