# DAT210x - Programming with Python for DS

## Module2 - Lab5

Import and alias Pandas:

In [1]:
import pandas as pd 

As per usual, load up the specified dataset, setting appropriate header labels.

In [3]:
df = pd.read_csv("Datasets/census.data",names=['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification'])

df.head()

Unnamed: 0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,Bachelors,39,2174,White,0,40,Male,<=50K
1,Bachelors,50,?,White,0,13,Male,<=50K
2,HS-grad,38,?,White,0,40,Male,<=50K
3,11th,53,?,Black,0,40,Male,<=50K
4,Bachelors,28,0,Black,0,40,Female,<=50K


In [4]:
import numpy as np
df[df==0] = np.nan

print(df.head(20))

       education  age capital-gain                race  capital-loss  \
0      Bachelors   39         2174               White           NaN   
1      Bachelors   50            ?               White           NaN   
2        HS-grad   38            ?               White           NaN   
3           11th   53            ?               Black           NaN   
4      Bachelors   28            0               Black           NaN   
5        Masters   37            0               White           NaN   
6            9th   49            0               Black           NaN   
7        HS-grad   52            0               White           NaN   
8        Masters   31        14084               White           NaN   
9      Bachelors   42         5178               White           NaN   
10  Some-college   37            0               Black           NaN   
11     Bachelors   30            0  Asian-Pac-Islander           NaN   
12     Bachelors   23            0               White          

Excellent.

Now, use basic pandas commands to look through the dataset. Get a feel for it before proceeding!

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see `object` where you expect to see `int32` or `float64`, that is a good indicator that there might be a string or missing value or erroneous value in the column.

In [5]:
df.dtypes

education          object
age                 int64
capital-gain       object
race               object
capital-loss      float64
hours-per-week      int64
sex                object
classification     object
dtype: object

Try use `your_data_frame['your_column'].unique()` or equally, `your_data_frame.your_column.unique()` to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the `na_values` parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

In [6]:
print(df.education.unique())

['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' '7th-8th'
 'Doctorate' '5th-6th' '10th' '1st-4th' 'Preschool' '12th']


In [7]:
edu_order =['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Bachelors', 'Masters', 'Doctorate']

In [8]:
df.education = df.education.astype("category",ordered=True,categories= edu_order).cat.codes

  """Entry point for launching an IPython kernel.


In [9]:
df.head(10)

Unnamed: 0,education,age,capital-gain,race,capital-loss,hours-per-week,sex,classification
0,10,39,2174,White,,40,Male,<=50K
1,10,50,?,White,,13,Male,<=50K
2,8,38,?,White,,40,Male,<=50K
3,6,53,?,Black,,40,Male,<=50K
4,10,28,0,Black,,40,Female,<=50K
5,11,37,0,White,,40,Female,<=50K
6,4,49,0,Black,,16,Female,<=50K
7,8,52,0,White,,45,Male,>50K
8,11,31,14084,White,,50,Female,>50K
9,10,42,5178,White,,40,Male,>50K


Look through your data and identify any potential categorical features. Ensure you properly encode any ordinal and nominal types using the methods discussed in the chapter.

Be careful! Some features can be represented as either categorical or continuous (numerical). If you ever get confused, think to yourself what makes more sense generally---to represent such features with a continuous numeric type... or a series of categories?

In [10]:
df = pd.get_dummies(df, columns=['race'])

In [11]:
df.head()

Unnamed: 0,education,age,capital-gain,capital-loss,hours-per-week,sex,classification,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,10,39,2174,,40,Male,<=50K,0,0,0,0,1
1,10,50,?,,13,Male,<=50K,0,0,0,0,1
2,8,38,?,,40,Male,<=50K,0,0,0,0,1
3,6,53,?,,40,Male,<=50K,0,0,1,0,0
4,10,28,0,,40,Female,<=50K,0,0,1,0,0


In [12]:
print(df.sex.unique())

['Male' 'Female']


In [13]:
df = pd.get_dummies(df, columns=['sex'])

In [14]:
df.head()

Unnamed: 0,education,age,capital-gain,capital-loss,hours-per-week,classification,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Female,sex_Male
0,10,39,2174,,40,<=50K,0,0,0,0,1,0,1
1,10,50,?,,13,<=50K,0,0,0,0,1,0,1
2,8,38,?,,40,<=50K,0,0,0,0,1,0,1
3,6,53,?,,40,<=50K,0,0,1,0,0,0,1
4,10,28,0,,40,<=50K,0,0,1,0,0,1,0


In [15]:
print(df.classification.unique())

['<=50K' '>50K']


In [16]:
df = pd.get_dummies(df, columns=['classification'])

In [17]:
df.head()

Unnamed: 0,education,age,capital-gain,capital-loss,hours-per-week,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Female,sex_Male,classification_<=50K,classification_>50K
0,10,39,2174,,40,0,0,0,0,1,0,1,1,0
1,10,50,?,,13,0,0,0,0,1,0,1,1,0
2,8,38,?,,40,0,0,0,0,1,0,1,1,0
3,6,53,?,,40,0,0,1,0,0,0,1,1,0
4,10,28,0,,40,0,0,1,0,0,1,0,1,0


Lastly, print out your dataframe!

In [18]:
print(df)
print("The number of columns now in the dataframe:", len(df.columns))

       education  age capital-gain  capital-loss  hours-per-week  \
0             10   39         2174           NaN              40   
1             10   50            ?           NaN              13   
2              8   38            ?           NaN              40   
3              6   53            ?           NaN              40   
4             10   28            0           NaN              40   
5             11   37            0           NaN              40   
6              4   49            0           NaN              16   
7              8   52            0           NaN              45   
8             11   31        14084           NaN              50   
9             10   42         5178           NaN              40   
10             9   37            0           NaN              80   
11            10   30            0           NaN              40   
12            10   23            0           NaN              30   
13             3   34            0           NaN

In [19]:
# 1 column (education) is a categorial ordinal variable
# 9 boolean columns were created, total
# The dataset is now 14 columns wise