In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA 
from sklearn.linear_model import LinearRegression
import altair as alt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
df = pd.read_csv('../data/old_train_data.zip')

In [3]:
df.drop(labels=['external_id','unacast_session_count'],axis=1).columns


Index(['month', 'year', 'monthly_number_of_sessions',
       'monthly_unique_sessions', 'monthly_repeated_sessions',
       'monthly_avg_length_of_session', 'monthly_avg_light_activity',
       'monthly_avg_moderate_activity', 'monthly_avg_vigorous_activity',
       'monthly_count_slide_single',
       ...
       'avg_wind_8_9', 'avg_wind_9_10', 'avg_wind_10_11', 'avg_wind_11_12',
       'avg_wind_12_above', 'perfect_days', 'hpi',
       'state_and_local_amount_per_capita', 'state_amount_per_capita',
       'local_amount_per_capita'],
      dtype='object', length=859)

What you will find in this notebook:
    
1. Data Overview

2. simple correlation

3. Variance

4. PCA


### Executive summary:

There are no Null values at all, however there are many zero values, with some columns having over 90% of the values as zeros, but this is most likely not an error, and just the correct data from the census.
Generally (but not exclusively) two types of data: counts (sets of columns usually) and monetary data (income related), meaning two very different scales (PCA conversation).

There are two types of columns with a relatively high variance compared to the others: 
    - Columns with high rates of zero 
    - Columns with monetary data, such as income
While the second may be addressed with some standardizations (change of base unit, say from dollars to thousends of dollars), the first is rather problematic in that sense (linear transformation wont move the zero values well, might skew variance).

runnign PCA it looks like 99.9% of the variance in the census data can be explained in 27 eigenvectors. 

### Data Overview

Started off with a very rough overview of the census variables and the topic on which they have data.(table above)
Reccomend going through once with the variable description table open and look through them.

# Rough overview of census data by topic

| category (roughly)                                              	| No. of Columns 	| sets of variables 	| First Column Index 	|
|-----------------------------------------------------------------	|----------------	|-------------------	|--------------	|
| Sex by age                                                      	| 14             	| 1                 	| 134          	|
| Commute                                                         	| 12             	| 2                 	| 148          	|
| Under 18                                                        	| 9              	| 1                 	| 160          	|
| Child age by family type (single father/mother, married )       	| 20             	| 1                 	| 169          	|
| Relationship to child (adopted, biological, etc’)               	| 16             	| 2                 	| 189          	|
| Median Family Income                                            	| 3              	| 1                 	| 205          	|
| Houshold type (people in houshold)                              	| 34             	| 4                 	| 208          	|
| Marital Status                                                  	| 19             	| 1                 	| 242          	|
| Birth by women (age and marital status)                         	| 13             	| 2                 	| 261          	|
| Education                                                       	| 12             	| 2                 	| 274          	|
| Spoken language – children                                      	| 6              	| 1                 	| 286          	|
| Poverty income and government food-stamps (includes gini)       	| 72             	| 10                	| 292          	|
| Age of children (by family and employment)                      	| 27             	| 1                 	| 364          	|
| Work (hours, employment)                                        	| 9              	| 2                 	| 391          	|
| Housing (vacancies, tenure, rent, mortgage) – 3 income by house 	| 29             	| 7                 	| 400          	|
| Health and disability                                           	| 6              	| 3                 	| 429          	|
| Sex by age                                                      	| 16             	| 1                 	| 435          	|
| Education (current enrolment)                                   	| 17             	| 2                 	| 451          	|
| commute (time)                                                  	| 4              	| 2                 	| 468          	|
| Houshold by size                                                	| 2              	| 1                 	| 472          	|
| Health Insurance                                                	| 4              	| 1                 	| 478          	|
| Fertility (mother and baby weight)                              	| 18             	| 3                 	| 482          	|

In [4]:
# creating the DF of census columns + primary key
df_census = pd.concat([df.iloc[:,0:3],df.iloc[:,132:498],df.loc[:,'unacast_session_count']],axis=1, join='outer',sort='false')
df_census.shape

(50100, 370)

In [5]:
df_census.columns

Index(['external_id', 'month', 'year', 'B20004e10', 'B11016e1', 'B12001e12',
       'B20004e11', 'B19125e1', 'B12001e13', 'B23008e22',
       ...
       'fertility_rate_2010', 'fertility_rate_2011', 'fertility_rate_2012',
       'fertility_rate_2013', 'fertility_rate_2014', 'fertility_rate_2015',
       'fertility_rate_2016', 'fertility_rate_2017', 'fertility_rate_2018',
       'unacast_session_count'],
      dtype='object', length=370)

### Simple correlation

Calculated a simple correlation between the sessions each census column (not sure there is too much meaning to be had here, since columns are typically in "sets") and plotted a histogram, generally between -0.1 and 0.1 correlation.

In [6]:
corr = df_census.corr()

In [7]:
alt.Chart(corr).mark_bar().encode(
    alt.X("unacast_session_count:Q", bin=alt.Bin(maxbins=30),title="Correlation (binned)"),
    y=alt.Y('count()', title="Count")).properties(
    title='Histogram - Correlation with Session Count'
)

### Variance

In order to faithfully look through the variance in each column and map outliers needed to look into the rate of null and zero values.
None of the columns have nulls, however some have high rates of zeros, this unfortunatly is probably not an error, and how to deal with this is something to contemplate.

In addition i looked into the Coefficient of Variance index (standard deviation over the mean), in order to get a look at a stadardized variance since all monetary have relatively high variance due to the unit, there is (unsurprisinly) a correlation between having many zero values (low mean) and a higher CoV score, and so this isnt nessecerily indicitive of anything, but i found it a little interesting as there are less then 30 columns with a CoV above 2, and it replaces the variance plot due to the scale disruptencies.



In [8]:
null_val = df_census.isnull().sum(axis = 0)/df_census.shape[0]
zero_val = (df_census == 0).sum(axis = 0)/df_census.shape[0]

In [9]:
df_desc = df_census.describe().T#(verbose=True)


In [10]:
# coefficiant of variation creation
df_desc["CoV"] = (df_desc["std"])/df_desc["mean"]
df_desc['zero'] = zero_val
df_desc['null'] = null_val
df_desc.sort_values(by=['zero'], axis=0,ascending=False)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,CoV,zero,null
B10010e3,50100.0,1752.109780,10412.469215,0.00,0.00,0.00,0.00,225417.00,5.942818,0.960479,0.000000
B13016e9,50100.0,1.124950,5.047604,0.00,0.00,0.00,0.00,78.00,4.486958,0.923353,0.000000
B11005e10,50100.0,1.556886,6.012560,0.00,0.00,0.00,0.00,119.00,3.861914,0.890220,0.000000
B17012e7,50100.0,2.473453,9.288690,0.00,0.00,0.00,0.00,110.00,3.755353,0.882236,0.000000
B23008e11,50100.0,3.190818,12.865915,0.00,0.00,0.00,0.00,195.00,4.032168,0.881836,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
fertility_rate_2003,50100.0,68.202802,10.231732,32.10,62.80,67.07,71.84,120.61,0.150019,0.000000,0.000000
avg_age_of_mother,50100.0,28.299261,1.238500,25.69,27.45,28.04,28.99,32.54,0.043764,0.000000,0.000000
avg_birth_weight,50100.0,3274.838319,55.949518,3028.16,3236.53,3273.41,3317.62,3438.81,0.017085,0.000000,0.000000
year,50100.0,2018.450000,0.497499,2018.00,2018.00,2018.00,2019.00,2019.00,0.000246,0.000000,0.000000


In [11]:
alt.Chart(df_desc).mark_bar().encode(
    alt.X("CoV:Q", bin=alt.Bin(maxbins=30)),
    y=alt.Y('count()', title="Count")).properties(
    title='Histogram - Coefficient of Variance (std/mean)'
)


In [12]:
# Count of no. of columns with a CoV higher then 2, one of which is the session counts
sum(df_desc.CoV>2)#-df_desc.shape[0]

26

In [13]:
alt.Chart(df_desc).mark_bar().encode(
    alt.X("zero:Q", bin=alt.Bin(step=0.1), title="Percent of Zero Value"),
    y=alt.Y('count()', title="Number of Columns")
).properties(
    title="Histogram - Count of Zeros per column"
)

In [14]:
# Count of no. of columns with a more then 50% of values being zero
sum(df_desc.zero>0.5)#-df_desc.shape[0]

30

**There are no missing (null) values**

In [15]:
df_desc.sort_values(by=['CoV'], axis=0,ascending=False)[df_desc['std']>1000]

  """Entry point for launching an IPython kernel.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max,CoV,zero,null
B10010e3,50100.0,1752.109780,10412.469215,0.0,0.0,0.0,0.0,225417.0,5.942818,0.960479,0.0
B10010e2,50100.0,12376.360080,30800.252527,0.0,0.0,0.0,0.0,250001.0,2.488636,0.818762,0.0
B10010e1,50100.0,34349.954092,48882.229683,0.0,0.0,0.0,62946.0,250001.0,1.423065,0.574850,0.0
B20004e14,50100.0,7730.606786,10880.784963,0.0,0.0,0.0,15486.0,82604.0,1.407494,0.583234,0.0
B20004e8,50100.0,17126.557685,17975.802437,0.0,0.0,18229.0,29130.0,250001.0,1.049586,0.419162,0.0
...,...,...,...,...,...,...,...,...,...,...,...
B20004e1,50100.0,40712.273852,15390.165006,0.0,30564.0,37500.0,47805.0,155574.0,0.378023,0.001198,0.0
B20004e16,50100.0,29098.591617,10880.180736,0.0,23688.0,29362.0,35156.0,120417.0,0.373907,0.048303,0.0
B20004e3,50100.0,29666.299401,10575.899384,0.0,24357.0,30109.0,35308.0,103973.0,0.356495,0.038723,0.0
B20004e13,50100.0,33049.859082,10908.732319,0.0,25563.0,31410.0,39694.0,80507.0,0.330069,0.003194,0.0


Income columns have high variability, consider lowering the scale (from dollars to hundresds of dollars say to lower the std number)
B19125e2 - 2016 census: Median Family Income by Presence of Own Children: With own children of the householder under 18 years

 <font size="3"> Relationship between rate of 0 value and std - potantially messes up PCA a little.</font> 

In [16]:
df_desc[df_desc['zero']>0.6]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,CoV,zero,null
B23008e24,50100.0,6.291816,18.572047,0.0,0.0,0.0,0.0,247.0,2.951778,0.769661,0.0
B23008e20,50100.0,9.114571,23.94557,0.0,0.0,0.0,7.0,329.0,2.627175,0.716567,0.0
B10010e2,50100.0,12376.36008,30800.252527,0.0,0.0,0.0,0.0,250001.0,2.488636,0.818762,0.0
B17012e7,50100.0,2.473453,9.28869,0.0,0.0,0.0,0.0,110.0,3.755353,0.882236,0.0
B11005e10,50100.0,1.556886,6.01256,0.0,0.0,0.0,0.0,119.0,3.861914,0.89022,0.0
B10010e3,50100.0,1752.10978,10412.469215,0.0,0.0,0.0,0.0,225417.0,5.942818,0.960479,0.0
B16007e7,50100.0,10.27505,50.74849,0.0,0.0,0.0,0.0,1293.0,4.939002,0.79481,0.0
B23008e11,50100.0,3.190818,12.865915,0.0,0.0,0.0,0.0,195.0,4.032168,0.881836,0.0
B09002e11,50100.0,8.358483,19.304604,0.0,0.0,0.0,9.0,243.0,2.309582,0.691018,0.0
B09002e10,50100.0,12.458283,23.56107,0.0,0.0,0.0,16.0,222.0,1.891197,0.607186,0.0


In [17]:
df_desc[df_desc['CoV']>2]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,CoV,zero,null
B23008e24,50100.0,6.291816,18.572047,0.0,0.0,0.0,0.0,247.0,2.951778,0.769661,0.0
B23008e20,50100.0,9.114571,23.94557,0.0,0.0,0.0,7.0,329.0,2.627175,0.716567,0.0
B08301e10,50100.0,82.473054,196.234411,0.0,0.0,23.0,86.0,4206.0,2.379376,0.267465,0.0
B17012e6,50100.0,12.713772,26.793419,0.0,0.0,0.0,14.0,336.0,2.107433,0.585629,0.0
B16007e5,50100.0,26.835928,58.741356,0.0,0.0,0.0,28.0,637.0,2.188907,0.502196,0.0
B10010e2,50100.0,12376.36008,30800.252527,0.0,0.0,0.0,0.0,250001.0,2.488636,0.818762,0.0
B17012e7,50100.0,2.473453,9.28869,0.0,0.0,0.0,0.0,110.0,3.755353,0.882236,0.0
B11005e10,50100.0,1.556886,6.01256,0.0,0.0,0.0,0.0,119.0,3.861914,0.89022,0.0
B10010e3,50100.0,1752.10978,10412.469215,0.0,0.0,0.0,0.0,225417.0,5.942818,0.960479,0.0
B16007e7,50100.0,10.27505,50.74849,0.0,0.0,0.0,0.0,1293.0,4.939002,0.79481,0.0


In [18]:
#How many of the columns with high CoV score, have at least 73.9% zero values
count = 0
check = list(df_desc[df_desc['CoV']>2.8].index)
denom = len(check)
for i in check:
    if i in list(df_desc[df_desc['zero']>0.739].index):
        count +=1
count/denom

1.0

### PCA
using PCA on the numeric columns in the census data (all 366 columns), 99.9% of the variance can be captured in 27 eigenvectors (7.4% of the number of columns) and also from the 29th to the 30th theres a drop in an order of magnitude of var explained as well as dropping below a 100th of a percent (0.00001). 

When running PCA a second time after removing columns with a high rate of 0 values (60% or more), the 99.9% was achieved in 25 vectors rather then 27, but this minor difference probably isnt worth going through the trouble of removing said columns (total of 21 such columns).

Worth considering if we want to maximize accuracy and reduce train time.

In [19]:
# df of numeric columns for the PCA
df_pca = df_census.select_dtypes(include=[np.number]).iloc[:,2:len(df_census.columns)-2]

In [20]:
df_pca.columns

Index(['B20004e10', 'B11016e1', 'B12001e12', 'B20004e11', 'B19125e1',
       'B12001e13', 'B23008e22', 'B11005e12', 'B19101e10', 'B23008e25',
       ...
       'fertility_rate_2009', 'fertility_rate_2010', 'fertility_rate_2011',
       'fertility_rate_2012', 'fertility_rate_2013', 'fertility_rate_2014',
       'fertility_rate_2015', 'fertility_rate_2016', 'fertility_rate_2017',
       'fertility_rate_2018'],
      dtype='object', length=366)

In [21]:
df_pca.std()

B20004e10              17596.127172
B11016e1                1059.909605
B12001e12                423.026309
B20004e11              31290.375944
B19125e1               35294.538401
                           ...     
fertility_rate_2014        7.652088
fertility_rate_2015        7.636146
fertility_rate_2016        7.189558
fertility_rate_2017        6.831797
fertility_rate_2018        6.837637
Length: 366, dtype: float64

## PCA with Scaled Data

In [22]:
scaler = StandardScaler()
df_pca_scaled = scaler.fit_transform(df_pca)

In [23]:
df_pca_scaled.std(axis = 0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1.

In [24]:
# Fitting sk-learn PCA
pca = PCA()
pca.fit(df_pca)

PCA()

In [25]:
# Fitting sk-learn PCA on scaled data
pca = PCA()
pca.fit(df_pca_scaled)

PCA()

In [26]:
pca.components_.shape

(366, 366)

In [27]:
pd.concat([pd.DataFrame(pca.components_),df.loc[:,'unacast_session_count']],axis=1, join='outer',sort='false')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,357,358,359,360,361,362,363,364,365,unacast_session_count
0,0.015633,0.079957,0.059637,0.013508,0.009862,0.079911,0.044614,0.067778,0.041101,0.048227,...,0.010000,0.008220,0.007448,0.008109,0.007775,0.007959,0.007081,0.004676,0.003183,90.0
1,0.068513,0.018534,-0.036938,0.092990,0.122628,0.043720,-0.029487,0.045358,-0.036229,-0.084698,...,-0.048464,-0.049029,-0.049405,-0.047522,-0.046511,-0.044619,-0.041985,-0.039823,-0.039666,27.0
2,0.032218,-0.060217,-0.046761,0.023795,0.034100,0.004796,-0.006544,-0.059868,-0.031492,0.001589,...,0.162235,0.158052,0.156166,0.155790,0.153577,0.150384,0.148607,0.139545,0.132533,27.0
3,-0.009406,0.042110,-0.049309,-0.010134,-0.027561,0.027087,0.004251,0.094684,0.061526,-0.068876,...,0.151945,0.157601,0.159749,0.162717,0.164777,0.168335,0.165759,0.168503,0.167313,24.0
4,-0.037794,0.034363,0.078194,0.002146,0.010873,-0.025932,-0.038917,-0.048157,-0.037095,-0.022906,...,0.085769,0.086241,0.088801,0.084971,0.082381,0.080288,0.075820,0.069572,0.061739,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50095,,,,,,,,,,,...,,,,,,,,,,55.0
50096,,,,,,,,,,,...,,,,,,,,,,75.0
50097,,,,,,,,,,,...,,,,,,,,,,83.0
50098,,,,,,,,,,,...,,,,,,,,,,


In [28]:
# Getting number of columns for 99% variance with scaled data = 171
for i in range(366):
    if sum(pca.explained_variance_ratio_[:i]) >= 0.99:
        print("Number of components for 99 percent variance is :", i)
        break

Number of components for 99 percent variance is : 171


In [29]:
pca.explained_variance_ratio_.shape

(366,)

In [30]:
#list of numbers between 1 and 100
no_of_vectors = [i for i in range(1,101)]
# list of sum of var explained by n first vectors
explained_variance = [round(sum(pca.explained_variance_ratio_[:i]),4) for i in range(1,101)]
#sum(pca.explained_variance_ratio_[:27])

In [31]:
d = {'num_of_vectors': no_of_vectors, 'total_explained_variance': explained_variance, 'marginal_explained_variance':list(pca.explained_variance_ratio_[:100])}
pca_output = pd.DataFrame(data=d).set_index('num_of_vectors')

In [32]:
pca_output[:30]

Unnamed: 0_level_0,total_explained_variance,marginal_explained_variance
num_of_vectors,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.3729,0.372867
2,0.4983,0.125429
3,0.5496,0.051277
4,0.58,0.030451
5,0.6047,0.024667
6,0.625,0.020309
7,0.6421,0.01715
8,0.6582,0.016078
9,0.6717,0.013473
10,0.6827,0.010962


In [33]:
df_desc[df_desc['zero']>0.6].index[:]

Index(['B23008e24', 'B23008e20', 'B10010e2', 'B17012e7', 'B11005e10',
       'B10010e3', 'B16007e7', 'B23008e11', 'B09002e11', 'B09002e10',
       'B09002e12', 'B23025e6', 'B11005e9', 'B23008e7', 'B23008e6',
       'B09001e10', 'B11005e8', 'B13016e3', 'B13016e8', 'B13016e9',
       'four_or_more_in_nonfamily_household'],
      dtype='object')

In [34]:
temp = pd.concat([df_pca@pca.components_[:, :27],df.loc[:,'unacast_session_count']],axis=1, join='outer',sort='false').rename(str, axis='columns').corr() #["unacast_session_count"]

In [35]:
mlr = LinearRegression()

In [36]:
# linear correlation between sessions and pca transormed data
alt.Chart(temp).mark_bar().encode(
    alt.X("unacast_session_count:Q", bin=alt.Bin(maxbins=30),title="Correlation (binned)"),
    y=alt.Y('count()', title="Count")).properties(
    title='Histogram - Correlation with Session Count'
)

In [37]:
# dropping columns with at least 60% zero rate
df_pca_2 = df_pca.drop((df_desc[df_desc['zero']>0.6].index), axis=1)

In [38]:
pca2 = PCA(n_components=100)
pca2.fit(df_pca_2)

PCA(n_components=100)

In [39]:
#list of numbers between 1 and 100
no_of_vectors = [i for i in range(1,101)]
# list of sum of var explained by n first vectors
explained_variance = [round(sum(pca2.explained_variance_ratio_[:i]),4) for i in range(1,101)]
#sum(pca.explained_variance_ratio_[:27])
d = {'num_of_vectors': no_of_vectors, 'total_explained_variance': explained_variance, 'marginal_explained_variance':list(pca2.explained_variance_ratio_)}
pca2_output = pd.DataFrame(data=d).set_index('num_of_vectors')

In [40]:
pca2_output[:30]

Unnamed: 0_level_0,total_explained_variance,marginal_explained_variance
num_of_vectors,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.5824,0.582356
2,0.7244,0.142065
3,0.7679,0.043519
4,0.7992,0.031227
5,0.8291,0.029942
6,0.8517,0.022556
7,0.8711,0.019447
8,0.8888,0.01771
9,0.9059,0.017107
10,0.9211,0.01518


In [41]:
X = df_census.drop(columns = ['external_id', 'unacast_session_count'])
y = df_census['unacast_session_count']

X = X.fillna(0)
y = y.fillna(0)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Z = pca.transform(df_pca_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 2020)
lr = LinearRegression()
lr.fit(X_train, y_train)

print("Test accuracy is :")
y_preds = lr.predict(X_test)                 
rmse = mean_squared_error(y_test, y_preds, squared=False)
r2 = r2_score(y_test, y_preds)
print("Root mean squared error: %0.3f and r^2 score: %0.3f" % (rmse,r2))

Test accuracy is :
Root mean squared error: 202.614 and r^2 score: 0.403


## Look at the columns to find some groups for PCA

### 'Women Who Had a Birth by Age' columns

In [42]:
df_census[['B13016e3', 'B13016e4', 'B13016e5', 'B13016e6', 'B13016e7', 'B13016e8', 'B13016e9']].sum(axis=1)

0         21
1        105
2        105
3        105
4        105
        ... 
50095     17
50096     17
50097     17
50098    347
50099    347
Length: 50100, dtype: int64

In [43]:
df_census['B13016e2']

0         21
1        105
2        105
3        105
4        105
        ... 
50095     17
50096     17
50097     17
50098    347
50099    347
Name: B13016e2, Length: 50100, dtype: int64

The sum of the columns 'B13016e3', 'B13016e4', 'B13016e5', 'B13016e6', 'B13016e7', 'B13016e8' and 'B13016e9' is equal to the column 'B13016e2'. 

### 'Poverty Status by Age: Income in the past 12 months below poverty level' columns

In [44]:
df_census['B17020e2']

0        308
1        193
2        193
3        193
4        193
        ... 
50095    120
50096    120
50097    120
50098    714
50099    714
Name: B17020e2, Length: 50100, dtype: int64

In [45]:
df_census[['B17020e3', 'B17020e4', 'B17020e5', 'B17020e6']].sum(axis=1)

0        300
1        169
2        169
3        169
4        169
        ... 
50095     95
50096     95
50097     95
50098    617
50099    617
Length: 50100, dtype: int64

### '2016 census: Median Family Income: Total' and '2016 census: Median Family Income by Presence of Own Children: Total'

In [46]:
df_census['B19113e1']

0         83259
1        100463
2        100463
3        100463
4        100463
          ...  
50095     84271
50096     84271
50097     84271
50098     72019
50099     72019
Name: B19113e1, Length: 50100, dtype: int64

In [47]:
(df_census['B19125e1'] == df_census['B19113e1']).sum()

50100

In [48]:
df_census['B19125e1']

0         83259
1        100463
2        100463
3        100463
4        100463
          ...  
50095     84271
50096     84271
50097     84271
50098     72019
50099     72019
Name: B19125e1, Length: 50100, dtype: int64

The two columns are the same

### '2016 census: Population Under 18 Years by Age: In households' and '2016 census: Sex by Age' 

In [49]:
df_census[['B01001e30', 'B01001e6']].sum(axis=1)

0        294
1        293
2        293
3        293
4        293
        ... 
50095     44
50096     44
50097     44
50098    646
50099    646
Length: 50100, dtype: int64

In [50]:
df_census['B09001e9']

0        288
1        293
2        293
3        293
4        293
        ... 
50095     44
50096     44
50097     44
50098    646
50099    646
Name: B09001e9, Length: 50100, dtype: int64

The two columns represent the same information.