<a href="https://colab.research.google.com/github/rikdantas/Aprendizagem-de-Maquinas/blob/main/IMD1101/Atividade_Pre_Processamento/IMD1101_Pre_Procesamento.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atividade de pré-processamento
Nesse notebook será desenvolvida a atividade prática de pré-processamento proposta na matéria de Aprendizado de Máquina (IMD1101).

Aluno: Paulo Ricardo Dantas

## Importando a base de dados
O primeiro passo é importar o dataset. O dataset escolhido foi o Diabetes.csv que está disponível no link a seguir: https://www.dropbox.com/scl/fo/g23norkdqy1z0y124zklf/ABUWZBocbv5tc3zLt1K14ak?rlkey=8swc156wdq9aqr11teh3dou0w&dl=0

Após fazer o download do arquivo CSV, é necessário fazer o upload do mesmo para o sistema de arquivos do colab.

In [13]:
# Importando as bibliotecas
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale

# Importando o Dataset
dataset = pd.read_csv('Diabetes.csv',encoding='utf-8')

## Visualizando atributos do dataset

### Características dos atributos

In [14]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   insu    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   classe  768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


### Média dos atributos (excluindo o atributo classe)


In [15]:
dataset.mean(numeric_only=True)

Unnamed: 0,0
preg,3.845052
plas,120.894531
pres,69.105469
skin,20.536458
insu,79.799479
mass,31.992578
pedi,0.471876
age,33.240885


### Mediana dos atributos (excluindo o atributo classe)

In [16]:
dataset.median(numeric_only=True)

Unnamed: 0,0
preg,3.0
plas,117.0
pres,72.0
skin,23.0
insu,30.5
mass,32.0
pedi,0.3725
age,29.0


## Limpeza e transformação dos atributos
Nessa parte serão realizadas as seguintes etapas:

1. Verifique se há missing values. Se houver aplique um dos métodos
relacionados a esse tipo de problema (mostrados nas aulas);
2. Aplique as transformações necessárias para os atributos numéricos e discretos;
3. Salve uma versão em CSV dessa base limpa e transformada.



### Verificando se existem missing values

In [17]:
missing = dataset.isnull().sum()
print(missing)

preg      0
plas      0
pres      0
skin      0
insu      0
mass      0
pedi      0
age       0
classe    0
dtype: int64


### Aplicando as transformações necessárias nos atributos

In [18]:
# Separando a coluna classe
classe = dataset['classe']

# Apagando a coluna classe do dataset original
dataset.drop('classe', axis=1, inplace=True)

# Aplicando normalização nos dados numéricos
dataset_normalizado = dataset.apply(minmax_scale)

dataset_normalizado.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2


In [19]:
# Juntando o dataset normalizado com a coluna das classes novamente
dataset_normalizado.insert(8, 'classe', classe)

dataset_normalizado.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,classe
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333,tested_positive
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667,tested_negative
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333,tested_positive
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0,tested_negative
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2,tested_positive


### Salvando a versão limpa e pré-processada

In [20]:
# Salvando Diabetes.csv transformado
dataset_normalizado.to_csv('Diabetes_Limpo.csv', index=False)

## Amostragem
Nessa parte serão realizadas diversas amostragens confomre solicitadas na prática:

1. Amostragem simples de 30% e sem reposição;
2. Amostragem simples de 30% e com reposição;
3. Amostragem simples de 50% e sem reposição;
4. Amostragem simples de 50% e com reposição;
5. Amostragem estratificada de 50% (mesmas proporções);
6. Amostragem simples de 70% e sem reposição;
7. Amostragem simples de 70% e com reposição;
8. Amostragem estratificada de 70% (mesmas proporções).

In [23]:
# Importando o CSV já tratado
df = pd.read_csv('Diabetes_Limpo.csv',encoding='utf-8')
df.head()

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age,classe
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333,tested_positive
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667,tested_negative
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333,tested_positive
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0,tested_negative
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2,tested_positive


### Amostragem simples de 30% e sem reposição

In [24]:
amostra01 = df.sample(frac = .30, replace = False)
print(amostra01)

# salvando em csv
amostra01.to_csv('amostra01.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
129  0.000000  0.527638  0.688525  0.000000  0.000000  0.415797  0.283091   
612  0.411765  0.844221  0.721311  0.424242  0.379433  0.569300  0.302733   
6    0.176471  0.391960  0.409836  0.323232  0.104019  0.461997  0.072588   
347  0.176471  0.582915  0.000000  0.000000  0.000000  0.350224  0.046541   
556  0.058824  0.487437  0.573770  0.404040  0.000000  0.567809  0.059778   
..        ...       ...       ...       ...       ...       ...       ...   
543  0.235294  0.422111  0.737705  0.232323  0.066194  0.588674  0.034586   
319  0.352941  0.974874  0.639344  0.000000  0.000000  0.350224  0.021776   
358  0.705882  0.442211  0.606557  0.404040  0.063830  0.526080  0.128096   
375  0.705882  0.703518  0.672131  0.434343  0.384161  0.584203  0.192143   
353  0.058824  0.452261  0.508197  0.121212  0.050827  0.405365  0.214347   

          age           classe  
129  0.683333  tested_positive  
612  0.31

### Amostragem simples de 30% e com reposição


In [25]:
amostra02 = df.sample(frac = .30, replace = True)
print(amostra02)

# salvando em csv
amostra02.to_csv('amostra02.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
527  0.176471  0.582915  0.606557  0.151515  0.124113  0.391952  0.012383   
442  0.235294  0.587940  0.524590  0.272727  0.141844  0.494784  0.064902   
130  0.235294  0.869347  0.573770  0.141414  0.198582  0.442623  0.120837   
172  0.117647  0.437186  0.000000  0.232323  0.000000  0.430700  0.296755   
575  0.058824  0.597990  0.360656  0.474747  0.074468  0.529061  0.086251   
..        ...       ...       ...       ...       ...       ...       ...   
548  0.058824  0.824121  0.672131  0.434343  0.079196  0.488823  0.112297   
74   0.058824  0.396985  0.614754  0.303030  0.000000  0.476900  0.135781   
109  0.000000  0.477387  0.696721  0.252525  0.042553  0.557377  0.072161   
190  0.176471  0.557789  0.508197  0.000000  0.000000  0.336811  0.027327   
263  0.176471  0.713568  0.655738  0.151515  0.000000  0.482861  0.052092   

          age           classe  
527  0.050000  tested_negative  
442  0.05

### Amostragem simples de 50% e sem reposição

In [26]:
amostra03 = df.sample(frac = .50, replace = False)
print(amostra03)

# salvando em csv
amostra03.to_csv('amostra03.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
390  0.058824  0.502513  0.540984  0.292929  0.231678  0.476900  0.156277   
105  0.058824  0.633166  0.459016  0.292929  0.179669  0.427720  0.308711   
289  0.294118  0.542714  0.590164  0.434343  0.088652  0.538003  0.078992   
515  0.176471  0.819095  0.573770  0.181818  0.124113  0.470939  0.081127   
620  0.117647  0.562814  0.704918  0.424242  0.189125  0.572280  0.071734   
..        ...       ...       ...       ...       ...       ...       ...   
26   0.411765  0.738693  0.622951  0.000000  0.000000  0.587183  0.076430   
169  0.176471  0.557789  0.737705  0.121212  0.092199  0.423249  0.178053   
753  0.000000  0.909548  0.721311  0.444444  0.602837  0.645306  0.061486   
346  0.058824  0.698492  0.377049  0.191919  0.098109  0.427720  0.245944   
47   0.117647  0.356784  0.573770  0.272727  0.000000  0.417288  0.216909   

          age           classe  
390  0.350000  tested_negative  
105  0.00

### Amostragem simples de 50% e com reposição

In [27]:
amostra04 = df.sample(frac = .50, replace = True)
print(amostra04)

# salvando em csv
amostra04.to_csv('amostra04.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
610  0.176471  0.532663  0.442623  0.212121  0.186761  0.460507  0.091375   
647  0.000000  0.899497  0.409836  0.363636  0.187943  0.563338  0.160974   
387  0.470588  0.527638  0.819672  0.363636  0.000000  0.645306  0.068745   
707  0.117647  0.638191  0.377049  0.212121  0.395981  0.512668  0.041845   
32   0.176471  0.442211  0.475410  0.111111  0.063830  0.369598  0.080700   
..        ...       ...       ...       ...       ...       ...       ...   
576  0.352941  0.542714  0.360656  0.202020  0.153664  0.357675  0.313834   
80   0.176471  0.567839  0.360656  0.131313  0.000000  0.333830  0.026473   
705  0.352941  0.402010  0.655738  0.363636  0.000000  0.593145  0.042272   
683  0.235294  0.628141  0.655738  0.000000  0.000000  0.481371  0.195559   
760  0.117647  0.442211  0.475410  0.262626  0.018913  0.423249  0.293766   

          age           classe  
610  0.050000  tested_negative  
647  0.01

### Amostragem estratificada de 50% (mesmas proporções)

In [33]:
# porcentagem da amostra (50%)
percentual_amostra = 0.5

# obtendo as classes da base de dados
classes = df['classe'].unique()

# armazenaremos, para cada classe, um DataFrame com suas amostras
amostras_por_classe = []

for c in classes:
    selecao_da_classe_atual = df.loc[df['classe'] == c]
    qtde_por_classe = round(len(selecao_da_classe_atual) * percentual_amostra)  # 50% das instâncias
    amostra_c = selecao_da_classe_atual.sample(n=qtde_por_classe)
    amostras_por_classe.append(amostra_c)

amostra05 = pd.concat(amostras_por_classe)

print(amostra05)

# salvando em CSV
amostra05.to_csv('amostra05.csv', index=False)


         preg      plas      pres      skin      insu      mass      pedi  \
31   0.176471  0.793970  0.622951  0.363636  0.289598  0.470939  0.330060   
323  0.764706  0.763819  0.737705  0.333333  0.034279  0.399404  0.278822   
408  0.470588  0.989950  0.606557  0.000000  0.000000  0.385991  0.475235   
681  0.000000  0.814070  0.622951  0.363636  0.000000  0.739195  0.122118   
269  0.117647  0.733668  0.000000  0.000000  0.000000  0.409836  0.069172   
..        ...       ...       ...       ...       ...       ...       ...   
73   0.235294  0.648241  0.704918  0.202020  0.319149  0.523100  0.065329   
341  0.058824  0.477387  0.606557  0.212121  0.086288  0.385991  0.254056   
735  0.235294  0.477387  0.491803  0.323232  0.000000  0.527571  0.087959   
294  0.000000  0.809045  0.409836  0.000000  0.000000  0.326379  0.075149   
12   0.588235  0.698492  0.655738  0.000000  0.000000  0.403875  0.581981   

          age           classe  
31   0.116667  tested_positive  
323  0.36

### Amostragem simples de 70% e sem reposição

In [34]:
amostra06 = df.sample(frac = .70, replace = False)
print(amostra06)

# salvando em csv
amostra06.to_csv('amostra06.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
727  0.000000  0.708543  0.688525  0.262626  0.000000  0.482861  0.151580   
552  0.352941  0.572864  0.721311  0.000000  0.000000  0.414307  0.072161   
655  0.117647  0.778894  0.426230  0.272727  0.638298  0.576751  0.069172   
123  0.294118  0.663317  0.655738  0.000000  0.000000  0.399404  0.046114   
175  0.470588  0.899497  0.590164  0.424242  0.153664  0.487332  0.273698   
..        ...       ...       ...       ...       ...       ...       ...   
441  0.117647  0.417085  0.540984  0.232323  0.059102  0.479881  0.178907   
423  0.117647  0.577889  0.524590  0.222222  0.000000  0.459016  0.146456   
279  0.117647  0.542714  0.508197  0.101010  0.328605  0.377049  0.342869   
606  0.058824  0.909548  0.639344  0.424242  0.346336  0.596125  0.503843   
555  0.411765  0.623116  0.573770  0.333333  0.254137  0.380030  0.035440   

          age           classe  
727  0.016667  tested_negative  
552  0.75

### Amostragem simples de 70% e com reposição

In [35]:
amostra07 = df.sample(frac = .70, replace = True)
print(amostra07)

# salvando em csv
amostra07.to_csv('amostra07.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
198  0.235294  0.547739  0.524590  0.444444  0.117021  0.518629  0.353117   
746  0.058824  0.738693  0.770492  0.414141  0.000000  0.734724  0.119556   
25   0.588235  0.628141  0.573770  0.262626  0.135934  0.463487  0.054227   
567  0.352941  0.462312  0.508197  0.323232  0.148936  0.476900  0.002989   
5    0.294118  0.582915  0.606557  0.000000  0.000000  0.381520  0.052519   
..        ...       ...       ...       ...       ...       ...       ...   
419  0.176471  0.648241  0.524590  0.292929  0.135934  0.393443  0.060205   
499  0.352941  0.773869  0.606557  0.323232  0.228132  0.436662  0.324936   
637  0.117647  0.472362  0.622951  0.181818  0.078014  0.470939  0.243809   
717  0.588235  0.472362  0.590164  0.181818  0.000000  0.344262  0.220751   
247  0.000000  0.829146  0.737705  0.333333  0.803783  0.779434  0.149018   

          age           classe  
198  0.083333  tested_positive  
746  0.10

### Amostragem estratificada de 70% (mesmas proporções)

In [36]:
# porcentagem da amostra (70%)
percentual_amostra = 0.7

# obtendo as classes da base de dados
classes = df['classe'].unique()

# armazenaremos, para cada classe, um DataFrame com suas amostras
amostras_por_classe = []

for c in classes:
    selecao_da_classe_atual = df.loc[df['classe'] == c]
    qtde_por_classe = round(len(selecao_da_classe_atual) * percentual_amostra)  # 70% das instâncias
    amostra_c = selecao_da_classe_atual.sample(n=qtde_por_classe)
    amostras_por_classe.append(amostra_c)

amostra08 = pd.concat(amostras_por_classe)

print(amostra08)

# salvando em CSV
amostra08.to_csv('amostra08.csv', index=False)

         preg      plas      pres      skin      insu      mass      pedi  \
276  0.411765  0.532663  0.491803  0.242424  0.000000  0.394933  0.093083   
130  0.235294  0.869347  0.573770  0.141414  0.198582  0.442623  0.120837   
676  0.529412  0.783920  0.704918  0.000000  0.000000  0.369598  0.064902   
663  0.529412  0.728643  0.655738  0.464646  0.153664  0.564829  0.238685   
292  0.117647  0.643216  0.639344  0.373737  0.215130  0.645306  0.489325   
..        ...       ...       ...       ...       ...       ...       ...   
459  0.529412  0.673367  0.606557  0.333333  0.070922  0.385991  0.163108   
59   0.000000  0.527638  0.524590  0.414141  0.167849  0.618480  0.040564   
351  0.235294  0.688442  0.688525  0.000000  0.000000  0.464978  0.074295   
89   0.058824  0.537688  0.557377  0.191919  0.000000  0.394933  0.037148   
724  0.058824  0.557789  0.770492  0.000000  0.000000  0.488823  0.079846   

          age           classe  
276  0.133333  tested_positive  
130  0.20