# Cleaning US Census Data Portfolio

Census data is collected at regular intervals using methodologies such as total counts, sample surveys, and administrative records. After it is collected or generated, census data is summarized to represent counts, estimates of groups of people for different geographic areas.

In [97]:
import glob
import pandas as pd
from sklearn.impute import KNNImputer

In [98]:
files = glob.glob("states*.csv")

df_list = []
for filename in files:
    data = pd.read_csv(filename)
    df_list.append(data)

df = pd.concat(df_list, ignore_index=True)

df.head(10)

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop
0,0,Rhode Island,1053661,13.36%,74.33%,5.68%,0.35%,3.25%,0.04%,"$59,125.27",510388M_543273F
1,1,South Carolina,4777576,5.06%,62.89%,28.75%,0.29%,1.25%,0.05%,"$46,296.81",2322409M_2455167F
2,2,South Dakota,843190,3.24%,82.50%,1.42%,9.42%,1.02%,0.04%,"$51,805.41",423477M_419713F
3,3,Tennessee,6499615,4.72%,73.49%,18.28%,0.23%,1.41%,0.04%,"$47,328.08",3167756M_3331859F
4,4,Texas,26538614,38.05%,44.69%,11.65%,0.26%,3.67%,0.07%,"$55,874.52",13171316M_13367298F
5,5,Utah,2903379,13.47%,79.41%,1.02%,1.08%,2.20%,0.83%,"$63,488.92",1459229M_1444150F
6,0,Utah,2903379,13.47%,79.41%,1.02%,1.08%,2.20%,0.83%,"$63,488.92",1459229M_1444150F
7,1,Vermont,626604,1.61%,93.98%,0.98%,0.30%,1.24%,0.03%,"$55,602.97",308573M_318031F
8,2,Virginia,8256630,8.01%,63.27%,20.18%,0.21%,5.46%,0.06%,"$72,866.01",4060948M_4195682F
9,3,Washington,6985464,11.14%,72.04%,3.38%,1.41%,7.02%,0.61%,"$64,493.77",3487725M_3497739F


Column "Unnammed:0" consists of indexes from every part of dataset and it doesn't bring any useful information so we remove them.

In [99]:
df.drop(columns="Unnamed: 0", inplace=True)

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   State      60 non-null     object
 1   TotalPop   60 non-null     int64 
 2   Hispanic   60 non-null     object
 3   White      60 non-null     object
 4   Black      60 non-null     object
 5   Native     60 non-null     object
 6   Asian      60 non-null     object
 7   Pacific    55 non-null     object
 8   Income     60 non-null     object
 9   GenderPop  60 non-null     object
dtypes: int64(1), object(9)
memory usage: 4.8+ KB


At the first glance we can see that there are 5 missing values in column "Pacific". We will take care of this problem later but now let's see if there are any duplicates in our dataset.

In [101]:
df.duplicated().sum()

9

Let's remove these 9 additional rows.

In [102]:
df = df.drop_duplicates()
df.head(10)

Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop
0,Rhode Island,1053661,13.36%,74.33%,5.68%,0.35%,3.25%,0.04%,"$59,125.27",510388M_543273F
1,South Carolina,4777576,5.06%,62.89%,28.75%,0.29%,1.25%,0.05%,"$46,296.81",2322409M_2455167F
2,South Dakota,843190,3.24%,82.50%,1.42%,9.42%,1.02%,0.04%,"$51,805.41",423477M_419713F
3,Tennessee,6499615,4.72%,73.49%,18.28%,0.23%,1.41%,0.04%,"$47,328.08",3167756M_3331859F
4,Texas,26538614,38.05%,44.69%,11.65%,0.26%,3.67%,0.07%,"$55,874.52",13171316M_13367298F
5,Utah,2903379,13.47%,79.41%,1.02%,1.08%,2.20%,0.83%,"$63,488.92",1459229M_1444150F
7,Vermont,626604,1.61%,93.98%,0.98%,0.30%,1.24%,0.03%,"$55,602.97",308573M_318031F
8,Virginia,8256630,8.01%,63.27%,20.18%,0.21%,5.46%,0.06%,"$72,866.01",4060948M_4195682F
9,Washington,6985464,11.14%,72.04%,3.38%,1.41%,7.02%,0.61%,"$64,493.77",3487725M_3497739F
10,West Virginia,1851420,1.29%,92.18%,3.66%,0.15%,0.68%,0.03%,"$41,437.11",913631M_937789F


The next problem we see is the dollar sign in every record in "Income" column. Due to this fact Python set this column datatype as "object" where it actually holds numerical data. Apart from removing dollar sign we need to change type of this column manually.

In [103]:
df.Income = df.Income.replace('[\$,]', '', regex=True)
df.Income = pd.to_numeric(df.Income)
df.head(10)

Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop
0,Rhode Island,1053661,13.36%,74.33%,5.68%,0.35%,3.25%,0.04%,59125.27,510388M_543273F
1,South Carolina,4777576,5.06%,62.89%,28.75%,0.29%,1.25%,0.05%,46296.81,2322409M_2455167F
2,South Dakota,843190,3.24%,82.50%,1.42%,9.42%,1.02%,0.04%,51805.41,423477M_419713F
3,Tennessee,6499615,4.72%,73.49%,18.28%,0.23%,1.41%,0.04%,47328.08,3167756M_3331859F
4,Texas,26538614,38.05%,44.69%,11.65%,0.26%,3.67%,0.07%,55874.52,13171316M_13367298F
5,Utah,2903379,13.47%,79.41%,1.02%,1.08%,2.20%,0.83%,63488.92,1459229M_1444150F
7,Vermont,626604,1.61%,93.98%,0.98%,0.30%,1.24%,0.03%,55602.97,308573M_318031F
8,Virginia,8256630,8.01%,63.27%,20.18%,0.21%,5.46%,0.06%,72866.01,4060948M_4195682F
9,Washington,6985464,11.14%,72.04%,3.38%,1.41%,7.02%,0.61%,64493.77,3487725M_3497739F
10,West Virginia,1851420,1.29%,92.18%,3.66%,0.15%,0.68%,0.03%,41437.11,913631M_937789F


Column "GenderPop" consists information about male and female population in particular state. Let's extract this data to make them useful.

In [104]:
string_split = df.GenderPop.str.split("_")

df['Man'] = string_split.str.get(0)

df['Woman'] = string_split.str.get(1)

df.Man = df.Man.str.replace("M", "")

df.Woman = df.Woman.str.replace("F", "")

df.drop(columns="GenderPop", inplace=True)

df.Man = pd.to_numeric(df.Man)
df.Woman = pd.to_numeric(df.Woman)

After splitting and creating new columns it might have turned out that we have new NaNs. Following command counts the sum of NaNs in every column.

In [105]:
df.isna().sum().sort_values(ascending=False)

Pacific     4
Woman       2
State       0
TotalPop    0
Hispanic    0
White       0
Black       0
Native      0
Asian       0
Income      0
Man         0
dtype: int64

We can easly deal with these 2 NaNs in "Woman" column by calculating them as a difference between "TotalPop" and "Man".

In [106]:
df = df.fillna(value = {"Woman": df.TotalPop - df.Man})
df.Woman = df.Woman.astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 58
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   State     51 non-null     object 
 1   TotalPop  51 non-null     int64  
 2   Hispanic  51 non-null     object 
 3   White     51 non-null     object 
 4   Black     51 non-null     object 
 5   Native    51 non-null     object 
 6   Asian     51 non-null     object 
 7   Pacific   47 non-null     object 
 8   Income    51 non-null     float64
 9   Man       51 non-null     int64  
 10  Woman     51 non-null     int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 4.8+ KB


Another thing we should do is changing percentage notation into decimal notation for ethnic contribution columns. After that data is useable for further analysis.

In [107]:
transform_columns = ["Hispanic", "White", "Black", "Native", "Asian", "Pacific"]
for column in transform_columns:
    df[column] = df[column].replace('[\%,]', '', regex=True)
    df[column] = pd.to_numeric(df[column])
    df[column] = df[column]*.01
df.head(10)

Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,Man,Woman
0,Rhode Island,1053661,0.1336,0.7433,0.0568,0.0035,0.0325,0.0004,59125.27,510388,543273
1,South Carolina,4777576,0.0506,0.6289,0.2875,0.0029,0.0125,0.0005,46296.81,2322409,2455167
2,South Dakota,843190,0.0324,0.825,0.0142,0.0942,0.0102,0.0004,51805.41,423477,419713
3,Tennessee,6499615,0.0472,0.7349,0.1828,0.0023,0.0141,0.0004,47328.08,3167756,3331859
4,Texas,26538614,0.3805,0.4469,0.1165,0.0026,0.0367,0.0007,55874.52,13171316,13367298
5,Utah,2903379,0.1347,0.7941,0.0102,0.0108,0.022,0.0083,63488.92,1459229,1444150
7,Vermont,626604,0.0161,0.9398,0.0098,0.003,0.0124,0.0003,55602.97,308573,318031
8,Virginia,8256630,0.0801,0.6327,0.2018,0.0021,0.0546,0.0006,72866.01,4060948,4195682
9,Washington,6985464,0.1114,0.7204,0.0338,0.0141,0.0702,0.0061,64493.77,3487725,3497739
10,West Virginia,1851420,0.0129,0.9218,0.0366,0.0015,0.0068,0.0003,41437.11,913631,937789


It's time to deal with NaNs in "Pacific" column. There is a couple of methods to fill NaNs for example mean imputer, median imputer or linear regression imputer. In this case we will use K Nearest Neighbour Imputer which is powered by machine learning algorithm. The idea is to impute the missing value by mean of k nearest neighbours of this NaN. Before implementing this method we need to make sure that our dataset is scaled but if we look closely we might see that columns with ethnic contribution are already scaled from 0 to 1. To sum up we will impute missing values based on this 6 columns.



In [108]:
imputer = KNNImputer(n_neighbors=5, weights = "uniform")
X = df[transform_columns]
X_transform = imputer.fit_transform(X)
df_temp = pd.DataFrame(X_transform, columns=transform_columns)
df.Pacific = df_temp.Pacific.values

Now we are ready to see describe statistics of columns in our dataset. On this stage we can spot any of unusual observations especially outliers.

In [109]:
df.describe()

Unnamed: 0,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,Man,Woman
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,6265067.0,0.12648,0.672612,0.119451,0.015478,0.036451,0.00293,55922.667255,3081423.0,3183644.0
std,7017552.0,0.156977,0.183943,0.118546,0.031053,0.052935,0.012277,11479.923759,3464446.0,3553646.0
min,626604.0,0.0129,0.0077,0.0009,0.0,0.0008,0.0,20720.54,306674.0,318031.0
25%,1860392.0,0.04675,0.56805,0.03055,0.00195,0.01245,0.0003,48358.54,921618.5,938774.0
50%,4397353.0,0.0846,0.7114,0.082,0.0036,0.0232,0.00044,54207.82,2164208.0,2233145.0
75%,6845525.0,0.13415,0.79635,0.1737,0.01035,0.03845,0.00099,63889.835,3393406.0,3476838.0
max,38421460.0,0.9889,0.9398,0.5178,0.1639,0.3659,0.0876,78765.4,19087140.0,19334330.0


In [110]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 58
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   State     51 non-null     object 
 1   TotalPop  51 non-null     int64  
 2   Hispanic  51 non-null     float64
 3   White     51 non-null     float64
 4   Black     51 non-null     float64
 5   Native    51 non-null     float64
 6   Asian     51 non-null     float64
 7   Pacific   51 non-null     float64
 8   Income    51 non-null     float64
 9   Man       51 non-null     int64  
 10  Woman     51 non-null     int64  
dtypes: float64(7), int64(3), object(1)
memory usage: 4.8+ KB


Now data is ready to further analysis.