# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install and load libraries

In [1]:
!pip install wandb

Successfully installed GitPython-3.1.43 docker-pycreds-0.4.0 gitdb-4.0.11 sentry-sdk-1.45.0 setproctitle-1.3.3 smmap-5.0.1 wandb-0.16.6


In [2]:
import wandb
import pandas as pd

## 1.3 Preprocessing

### 1.3.1 Login wandb


In [None]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [5]:
!wandb login c31b37325ded6fca6cc264d97e82531864b3d69a

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


### 1.3.2 Artifacts

In [6]:
input_artifact="decision_tree/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data after preprocessing"

### 1.3.3 Setup your wandb project and clean the dataset

After the fetch step the raw data artifact was generated.
Now, we need to pre-processing the raw data to create a new artfiact (clean_data).

In [7]:
# create a new job_type
run = wandb.init(project="decision_tree", job_type="process_data")

[34m[1mwandb[0m: Currently logged in as: [33mthiagotheiry05[0m ([33mteam-thiago[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [8]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact(input_artifact)

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [9]:
# Delete duplicated rows
df.drop_duplicates(inplace=True)

# Generate a "clean data file"
df.to_csv(artifact_name,index=False)

In [10]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                          type=artifact_type,
                          description=artifact_description)
artifact.add_file(artifact_name)

ArtifactManifestEntry(path='preprocessed_data.csv', digest='J1V627eFf/f14pxG4ZOYzg==', size=3808822, local_path='/root/.local/share/wandb/artifacts/staging/tmpy015aw0y', skip_cache=False)

In [11]:
# Upload the artifact to Wandb
run.log_artifact(artifact)

<Artifact preprocessed_data.csv>

In [12]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()

VBox(children=(Label(value='3.642 MB of 3.642 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))