<a href="https://colab.research.google.com/github/wesleyroseno/colaboratory/blob/main/Curso_NVIDIA_Exerc%C3%ADcio_Desafio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Curso NVIDIA - Desafio

> Link para o dataset: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents

Etapas do desafio:

1. Carregar o dataset usando o cuDF - como são milhões de registros será bem mais visível a diferença usar o cuDF ao invés de Pandas.
2. Remover os valores nulos
3. Remover colunas desnecessárias
 * Mantenha apenas as seguintes colunas: ['Severity', 'Source', 'County', 'State', 'Weather_Condition', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']
4. Tratamento com Label encoding
5. Balanceamento do dataset
 * Dica: nessa etapa será necessário trabalhar com a interoperabilidade entre as bibliotecas e então será mais interessante converter para dataframe Pandas, e assim realizar os processamentos necessários. Além disso, será util para a próxima etapa.
6. Tratamento com o encoding - como alternativa, faça o [Ordinal Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) ao invés do One Hot Encoding.
7. Escalonamento dos valores
8. Divisão do conjunto de treinamento
 * Dica: na sequência, será preciso converter de dataframe Pandas para cuDF
9. Treinar com o algoritmo KNeighborsClassifier
10. Realizar a predição e cálculo de acurácia


In [None]:
!pip install -q kaggle

* Dentro do painel do Kaggle, acesse a página Configurações
https://www.kaggle.com/settings

* Na seção API, selecione o botão [Create New Token]

* Será baixado um arquivo chamado kaggle.json

* Envie esse arquivo para o Colab.


In [None]:
!mkdir ~/.kaggle

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list

ref                                                        title                                         size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  -------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
nelgiriyewithana/top-spotify-songs-2023                    Most Streamed Spotify Songs 2023              47KB  2023-08-26 11:04:57          10133        330  1.0              
carlmcbrideellis/zzzs-lightweight-training-dataset-target  Zzzs: Lightweight training dataset + target  185MB  2023-09-11 07:21:51            340         50  1.0              
muhammadtalhaawan/world-export-and-import-dataset          World Export & Import Dataset (1989 - 2023)  721KB  2023-09-09 18:59:41           1115         32  1.0              
josephinelsy/spotify-top-hit-playlist-2010-2022            Spotify Top Hit Playlist (2010-2022)         210KB  2023-09-0

In [None]:
!kaggle datasets download -d sobhanmoosavi/us-accidents

Downloading us-accidents.zip to /content
 99% 647M/653M [00:07<00:00, 133MB/s]
100% 653M/653M [00:08<00:00, 85.6MB/s]


In [None]:
!mkdir dataset

In [None]:
!unzip us-accidents.zip -d dataset

Archive:  us-accidents.zip
  inflating: dataset/US_Accidents_March23.csv  


## Instalação e importação

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 390, done.[K
remote: Counting objects: 100% (121/121), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 390 (delta 89), reused 51 (delta 51), pack-reused 269[K
Receiving objects: 100% (390/390), 107.11 KiB | 2.19 MiB/s, done.
Resolving deltas: 100% (191/191), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.4 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
We will now install RAPIDS cuDF, cuML, and cuGraph via pip! 
Please stand by, should be quick...
***********************************************************************

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu11
  Downloading https://py

In [None]:
import cudf
import cuml
import cupy as cp

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## 1) Carregamento do dataset

Faremos a leitura do dataset no .csv usando o cuDF

caso tenha problemas de falta de memória ao ler o arquivo, uma opção é limitar as linhas lidas usando o parâmetro nrows na função read_csv. Exemplo: `cudf.read_csv('/content/dataset/US_Accidents_March23.csv', nrows=6_000_000)`

A GPU que atualmente é atribuída à sessão grauita do Colab (T4) possui memória mais que o suficiente para leitura de todos os dados desse dataset

In [None]:
%time
base = cudf.read_csv('/content/dataset/US_Accidents_March23.csv')

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs


In [None]:
base.shape

(7728394, 46)

## 2) Remoção de valores nulos

In [None]:
df = base.dropna()

In [None]:
df

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
3402762,A-3412645,Source1,3,2016-02-08 00:37:08,2016-02-08 06:37:08,40.108910,-83.092860,40.112060,-83.031870,3.230,...,False,False,False,False,False,False,Night,Night,Night,Night
3402767,A-3412650,Source1,3,2016-02-08 07:53:43,2016-02-08 13:53:43,39.172393,-84.492792,39.170476,-84.501798,0.500,...,False,False,False,False,False,False,Day,Day,Day,Day
3402771,A-3412654,Source1,2,2016-02-08 11:51:46,2016-02-08 17:51:46,41.375310,-81.820170,41.367860,-81.821740,0.521,...,False,False,False,False,False,False,Day,Day,Day,Day
3402773,A-3412656,Source1,2,2016-02-08 15:16:43,2016-02-08 21:16:43,40.109310,-82.968490,40.110780,-82.984000,0.826,...,False,False,False,False,False,False,Day,Day,Day,Day
3402774,A-3412657,Source1,2,2016-02-08 15:43:50,2016-02-08 21:43:50,39.192880,-84.477230,39.196150,-84.473350,0.307,...,False,False,False,False,False,False,Day,Day,Day,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7728389,A-7777757,Source1,2,2019-08-23 18:03:25,2019-08-23 18:32:01,34.002480,-117.379360,33.998880,-117.370940,0.543,...,False,False,False,False,False,False,Day,Day,Day,Day
7728390,A-7777758,Source1,2,2019-08-23 19:11:30,2019-08-23 19:38:23,32.766960,-117.148060,32.765550,-117.153630,0.338,...,False,False,False,False,False,False,Day,Day,Day,Day
7728391,A-7777759,Source1,2,2019-08-23 19:00:21,2019-08-23 19:28:49,33.775450,-117.847790,33.777400,-117.857270,0.561,...,False,False,False,False,False,False,Day,Day,Day,Day
7728392,A-7777760,Source1,2,2019-08-23 19:00:21,2019-08-23 19:29:42,33.992460,-118.403020,33.983110,-118.395650,0.772,...,False,False,False,False,False,False,Day,Day,Day,Day


In [None]:
type(df)

cudf.core.dataframe.DataFrame

## 3) Remoção de colunas desnecessárias

In [None]:
cols_dataset = df.columns.values.tolist()
cols_dataset

['ID',
 'Source',
 'Severity',
 'Start_Time',
 'End_Time',
 'Start_Lat',
 'Start_Lng',
 'End_Lat',
 'End_Lng',
 'Distance(mi)',
 'Description',
 'Street',
 'City',
 'County',
 'State',
 'Zipcode',
 'Country',
 'Timezone',
 'Airport_Code',
 'Weather_Timestamp',
 'Temperature(F)',
 'Wind_Chill(F)',
 'Humidity(%)',
 'Pressure(in)',
 'Visibility(mi)',
 'Wind_Direction',
 'Wind_Speed(mph)',
 'Precipitation(in)',
 'Weather_Condition',
 'Amenity',
 'Bump',
 'Crossing',
 'Give_Way',
 'Junction',
 'No_Exit',
 'Railway',
 'Roundabout',
 'Station',
 'Stop',
 'Traffic_Calming',
 'Traffic_Signal',
 'Turning_Loop',
 'Sunrise_Sunset',
 'Civil_Twilight',
 'Nautical_Twilight',
 'Astronomical_Twilight']

In [None]:
cols = ['Severity', 'Source', 'County', 'State', 'Weather_Condition', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']
cols

['Severity',
 'Source',
 'County',
 'State',
 'Weather_Condition',
 'Sunrise_Sunset',
 'Civil_Twilight',
 'Nautical_Twilight',
 'Astronomical_Twilight',
 'Amenity',
 'Bump',
 'Crossing',
 'Give_Way',
 'Junction',
 'No_Exit',
 'Railway',
 'Roundabout',
 'Station',
 'Stop',
 'Traffic_Calming',
 'Traffic_Signal',
 'Turning_Loop']

In [None]:
len(cols)

22

In [None]:
drop_cols = [c for c in cols_dataset if c not in cols]
drop_cols

['ID',
 'Start_Time',
 'End_Time',
 'Start_Lat',
 'Start_Lng',
 'End_Lat',
 'End_Lng',
 'Distance(mi)',
 'Description',
 'Street',
 'City',
 'Zipcode',
 'Country',
 'Timezone',
 'Airport_Code',
 'Weather_Timestamp',
 'Temperature(F)',
 'Wind_Chill(F)',
 'Humidity(%)',
 'Pressure(in)',
 'Visibility(mi)',
 'Wind_Direction',
 'Wind_Speed(mph)',
 'Precipitation(in)']

In [None]:
df = df.drop(columns=drop_cols)

In [None]:
df

Unnamed: 0,Source,Severity,County,State,Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
3402762,Source1,3,Franklin,OH,Light Rain,False,False,False,False,False,...,False,False,False,False,False,False,Night,Night,Night,Night
3402767,Source1,3,Hamilton,OH,Light Rain,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day
3402771,Source1,2,Cuyahoga,OH,Snow,False,False,False,False,True,...,False,False,False,False,False,False,Day,Day,Day,Day
3402773,Source1,2,Franklin,OH,Snow,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day
3402774,Source1,2,Hamilton,OH,Light Snow,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7728389,Source1,2,Riverside,CA,Fair,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day
7728390,Source1,2,San Diego,CA,Fair,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day
7728391,Source1,2,Orange,CA,Partly Cloudy,False,False,False,False,True,...,False,False,False,False,False,False,Day,Day,Day,Day
7728392,Source1,2,Los Angeles,CA,Fair,False,False,False,False,False,...,False,False,False,False,False,False,Day,Day,Day,Day


## 4) Tratamento com Label encoder  



In [None]:
from cuml.preprocessing import LabelEncoder

In [None]:
label_encoder = LabelEncoder()



Faremos de um jeito melhor, ao invés de escrever manualmente para cada uma das 21 colunas (lembrando que o Severity ficará de fora), como abaixo  

```
df['Weather_Condition'] = label_encoder.fit_transform(df['Weather_Condition'])
df['Amenity']= label_encoder.fit_transform(df['Amenity'])
[...]
```

In [None]:
for c in cols:
  if c != 'Severity':
    df[c] = label_encoder.fit_transform(df[c])

In [None]:
df

Unnamed: 0,Source,Severity,County,State,Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
3402762,0,3,532,33,53,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1
3402767,0,3,638,33,53,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3402771,0,2,388,33,97,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3402773,0,2,532,33,97,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3402774,0,2,638,33,61,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7728389,0,2,1258,3,14,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728390,0,2,1310,3,14,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7728391,0,2,1098,3,76,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7728392,0,2,870,3,14,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 5) Balanceamento do dataset


Podemos checar como os valores estão distribuídos nas diferentes categorias usando o value_counts()

In [None]:
df['Severity'].value_counts()

2    3348445
4     112511
3      68026
1      25567
Name: Severity, dtype: int32

In [None]:
df = df.to_pandas()

In [None]:
type(df)

pandas.core.frame.DataFrame

In [None]:
from sklearn.utils import resample

In [None]:
df_s1 = df[df['Severity'] == 1]
df_s2 = df[df['Severity'] == 2]
df_s3 = df[df['Severity'] == 3]
df_s4 = df[df['Severity'] == 4]

In [None]:
df_s1

Unnamed: 0,Source,Severity,County,State,Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
6525681,0,1,1163,2,71,0,0,1,1,0,...,0,0,1,0,0,0,0,0,0,0
6527285,0,1,1163,2,14,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
6540036,0,1,868,33,58,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6542183,0,1,912,2,14,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
6543739,0,1,1384,3,14,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7228848,0,1,1571,25,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7228849,0,1,200,25,76,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7228851,0,1,209,25,71,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7228852,0,1,1571,25,71,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
count = int(max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0]) / 20)

In [None]:
count

16742

In [None]:
df_s1 = resample(df_s1, replace=df_s1.count()[0] < count, n_samples=count, random_state=42)
df_s2 = resample(df_s2, replace=df_s2.count()[0] < count, n_samples=count, random_state=42)
df_s3 = resample(df_s3, replace=df_s3.count()[0] < count, n_samples=count, random_state=42)
df_s4 = resample(df_s4, replace=df_s4.count()[0] < count, n_samples=count, random_state=42)

In [None]:
df = pd.concat([df_s1, df_s2, df_s3, df_s4])

In [None]:
print(df['Severity'].value_counts())

#df.groupby(by='Severity')['Severity'].count()

1    16742
2    16742
3    16742
4    16742
Name: Severity, dtype: int64


## 6) Tratamento com o OneHot Encoding

In [None]:
X = df.drop('Severity', axis=1)

In [None]:
X

Unnamed: 0,Source,County,State,Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
7151782,0,912,2,14,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6929811,0,25,36,76,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7154275,0,11,3,14,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7081917,0,459,25,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7144570,0,46,4,76,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4360711,0,1212,4,114,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5224143,0,248,25,14,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3928245,0,65,29,14,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6778660,0,1104,32,7,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1


In [None]:
from cuml.preprocessing import OneHotEncoder
from cuml.compose import ColumnTransformer

In [None]:
idx = [*range(0,20)]
idx

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [None]:
onehotencoder = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(sparse=False), idx)], remainder='passthrough')

In [None]:
X = onehotencoder.fit_transform(X)

In [None]:
X.shape

(66968, 1466)

## 7) Escalonamento dos valores

In [None]:
from cuml.preprocessing import StandardScaler
scaler_dataset = StandardScaler()

X = scaler_dataset.fit_transform(X)

In [None]:
print(X)
print(X.shape)

       0         1         2         3        4         5         6     \
0       0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
1       0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
2       0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
3       0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
4       0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
...     ...       ...       ...       ...      ...       ...       ...   
66963   0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
66964   0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
66965   0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
66966   0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   
66967   0.0 -0.008641 -0.007729 -0.018128 -0.02117 -0.008641 -0.079253   

           7         8         9     ...      1456      1457  1458      1459  \
0     -0.020814 -0.007729 -0.01

## 8) Divisão entre previsores e classe

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
y = df['Severity']

y = LabelEncoder().fit_transform(y)

In [None]:
y

7151782    0
6929811    0
7154275    0
7081917    0
7144570    0
          ..
4360711    3
5224143    3
3928245    3
6778660    3
6102421    3
Length: 66968, dtype: uint8

In [None]:
y.shape

(66968,)

In [None]:
X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X, y, test_size = 0.15, random_state = 42)

In [None]:
X_treinamento.shape, X_teste.shape, y_treinamento.shape, y_teste.shape

((56922, 1466), (10046, 1466), (56922,), (10046,))

In [None]:
import pickle
with open('dataset.pkl', mode = 'wb') as f:
  pickle.dump([X_treinamento, y_treinamento, X_teste, y_teste], f)

## 9) Treinamento com o algoritmo

In [None]:
from cuml.neighbors import KNeighborsClassifier

In [None]:
with open('dataset.pkl', 'rb') as f:
  X_treinamento, y_treinamento, X_teste, y_teste = pickle.load(f)

In [None]:
X_treinamento_cudf = cudf.DataFrame.from_pandas(X_treinamento)
X_teste_cudf = cudf.DataFrame.from_pandas(X_teste)

y_treinamento_cudf = cudf.Series(y_treinamento.values)
y_teste_cudf = cudf.Series(y_teste.values)

In [None]:
%time

knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_treinamento, y_treinamento)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.87 µs


KNeighborsClassifier()

## 10) Predição e cálculo de acurácia

In [None]:
previsoes = knn.predict(X_teste)
previsoes

61798    3
10741    0
11048    3
63662    3
26151    1
        ..
27849    2
37385    3
41310    2
29818    3
25400    0
Length: 10046, dtype: uint8

In [None]:
from cuml.metrics import accuracy_score

In [None]:
accuracy_score(y_teste, previsoes)

0.6036233305931091

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!nvidia-smi

Mon Sep 18 14:13:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    25W /  70W |   7579MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces