# DAAN 822 Data Collection & Cleaning (2025)
## Group 5: Afolabi Isiaka, Matthew Kucas
## The New York City Police Department Stop-and-Frisk Program: Recent Trends

### Note: This data report is modeled after the data report code provided in DAAN 822, Lesson 9

Import libraries

In [192]:
import pandas as pd 

# SECTION 1: Stop and Frisk data

Import data into pandas dataframe

In [193]:
#Load Stop and Frisk data file
saf_data = pd.read_csv('C:\\Users\\mekna\\OneDrive\\Documents\\PSU\\DAAN_822\\Project\\Data_Warehouse\\NYPD_Stop_and_Frisk.csv', low_memory=False)

First check if any rows in the saf_data dataset are not unique (look for duplicates)

In [194]:
# Code ref: Gemini AI
# Check for duplicate rows
unique_rows = not saf_data.duplicated().any()

if unique_rows:
    print("No duplicates present")
else:
    print("Duplicates present")

No duplicates present


Create dataframe restricted to data that will be necessary for analysis (based on previous analyses of data)

In [195]:
# Identify column types
numerical_cols = ['year', 'pct', 'datestop', 'timestop', 'perobs', 'ht_feet', 'ht_inch', 'city', 'weight', 'xcoord', 'ycoord', 'age', 'detailcm', 'detailCM']
binary_cols = ['arstmade', 'sumissue', 'frisked', 'searched', 'pistol', 'riflshot', 'asltweap', 'knifcuti', 'machgun', 'othrweap']
categorical_cols = ['crimsusp', 'sex', 'race', 'inout', 'trhsloc']  # Start with key categoricals

In [196]:
# Create analysis dataset for numerical and binary columns
saf_analysis_data = saf_data[numerical_cols + binary_cols + categorical_cols].copy()

Create and print dataframe to verify names of columns in saf_analysis_data

In [197]:
columns = pd.DataFrame(list(saf_analysis_data.columns.values))

columns

Unnamed: 0,0
0,year
1,pct
2,datestop
3,timestop
4,perobs
5,ht_feet
6,ht_inch
7,city
8,weight
9,xcoord


Create and print dataframe to show data type of each column in saf_analysis_data

In [198]:
data_types = pd.DataFrame(saf_analysis_data.dtypes,columns=['Data Type']) 

data_types 

Unnamed: 0,Data Type
year,int64
pct,int64
datestop,int64
timestop,int64
perobs,float64
ht_feet,int64
ht_inch,int64
city,object
weight,int64
xcoord,object


Get count of missing values in columns of saf_analysis_data dataframe

In [199]:
pd.set_option('display.max_rows', None)

#Find null values
missing_data_counts = pd.DataFrame(saf_analysis_data.isnull().sum(),columns=['Missing Values']) 
missing_data_counts

Unnamed: 0,Missing Values
year,0
pct,0
datestop,0
timestop,0
perobs,0
ht_feet,0
ht_inch,0
city,0
weight,0
xcoord,0


In [200]:
#Find blank values
blank_data_counts = pd.DataFrame((saf_analysis_data == ' ').sum(),columns=['Missing Values'])
blank_data_counts

Unnamed: 0,Missing Values
year,0
pct,0
datestop,0
timestop,0
perobs,0
ht_feet,0
ht_inch,0
city,5
weight,0
xcoord,23340


In [201]:
#Find ** values - identified as present in Age column in previous analysis of data
placeholder_data_counts = pd.DataFrame((saf_analysis_data == '**').sum(),columns=['Missing Values'])
placeholder_data_counts

Unnamed: 0,Missing Values
year,0
pct,0
datestop,0
timestop,0
perobs,0
ht_feet,0
ht_inch,0
city,0
weight,0
xcoord,0


In [202]:
saf_total_missing_values = missing_data_counts + blank_data_counts + placeholder_data_counts 
saf_total_missing_values

Unnamed: 0,Missing Values
year,0
pct,0
datestop,0
timestop,0
perobs,0
ht_feet,0
ht_inch,0
city,5
weight,0
xcoord,23340


Create and print dataframe with the count of present values in each column of saf_analysis_data

In [203]:
present_data_counts = pd.DataFrame(saf_analysis_data.count(), columns=['Present Values']) 

present_data_counts 

Unnamed: 0,Present Values
year,793112
pct,793112
datestop,793112
timestop,793112
perobs,793112
ht_feet,793112
ht_inch,793112
city,793112
weight,793112
xcoord,793112


Create and print dataframe count of unique values in each column of saf_analysis_data

In [204]:
unique_value_counts = pd.DataFrame(columns=['Unique Values']) 

for v in list(saf_analysis_data.columns.values): 
    unique_value_counts.loc[v] = [saf_analysis_data[v].nunique()] 

unique_value_counts 

Unnamed: 0,Unique Values
year,4
pct,77
datestop,1461
timestop,1440
perobs,159
ht_feet,5
ht_inch,12
city,7
weight,393
xcoord,92276


Print the unique values for each column in saf_analysis_data to find other missing / incorrect data

In [205]:
pd.set_option('display.max_columns', None)

# Check columns for unique values
for col in (numerical_cols + binary_cols + categorical_cols):
    print(f"\nUnique values in {col}: {saf_analysis_data[col].unique()}")


Unique values in year: [2012 2013 2014 2015]

Unique values in pct: [ 40  23  81  66  32  43  75  67  44  25  73  48  60 110  14  79   9  41
  28  63  83  34  71  42  20  47   1  33   6  17  77  10  94  70  26  72
  24  13  46 103  62  30 113   5  19 108  18 114  52  22 122  88  61 100
   7 109  49  84 105 112 104 101  76 107  90  45 102 115 111 120  69  78
  68 123 106  50 121]

Unique values in datestop: [ 1012012  1022012  1042012 ... 12292015 12302015 12312015]

Unique values in timestop: [ 115  310 2000 ...  717  737  704]

Unique values in perobs: [  2.   1.   3.   5.  10.  30.   6.   8.   4.  35.  15.  12.   7.  20.
   0.  14.  60.   9.  50.  45.  55.  18.  11.  25. 120. 115.  99.  13.
  27.  21.  17.  23.  40.  19.  43. 100.  26.  16.  28.  90. 420. 925.
  81. 955.  79.  85. 103.  41.  22.  24. 180.  36. 200.  58. 144. 515.
 183.  52.  39.  73.  83. 225. 303. 245.  32.  74. 109.  33.  75.  54.
  80. 250. 140. 240.  53.  38. 152. 650.  51.  44. 754.  47. 523. 456.
 320. 315. 10

Get number of age values that are > 100

In [206]:
# Cast age to integer for statistical analysis (first need to clean a bit by removing space and ** values from the Age field)
saf_analysis_data['age'] = saf_analysis_data['age'].replace([' '], 0)
saf_analysis_data['age'] = saf_analysis_data['age'].replace(['**'], 0)
saf_analysis_data['age'] = saf_analysis_data['age'].astype(int)

#Count the number of obviously errant age values (all above 100 years old)
count = (saf_analysis_data['age'] > 100).sum()
count

1407

Merge dataframes by index

In [207]:
data_quality_report = data_types.join(present_data_counts). join(saf_total_missing_values).join(unique_value_counts)

Print report on saf_analysis_data

In [208]:
print("\nData Quality Report") 

print("Total records: {}".format(len(saf_analysis_data.index)))


Data Quality Report
Total records: 793112


Use the describe function to generate summary stats for the entire saf_analysis_data dataset

In [209]:
saf_analysis_data[numerical_cols].describe() 

Unnamed: 0,year,pct,datestop,timestop,perobs,ht_feet,ht_inch,weight,age,detailCM
count,793112.0,793112.0,793112.0,793112.0,793112.0,793112.0,793112.0,793112.0,793112.0,260201.0
mean,2012.442704,67.658021,5263720.0,1406.611529,2.491459,5.187991,6.393265,169.631614,28.892837,38.964078
std,0.729981,32.669559,3320063.0,746.410826,5.479971,0.40182,3.392153,37.587987,25.860702,26.788605
min,2012.0,1.0,1012012.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0
25%,2012.0,42.0,2232012.0,941.0,1.0,5.0,4.0,150.0,19.0,20.0
50%,2012.0,71.0,4262013.0,1610.0,1.0,5.0,7.0,169.0,24.0,27.0
75%,2013.0,101.0,8062012.0,2030.0,2.0,5.0,9.0,180.0,34.0,46.0
max,2015.0,123.0,12312020.0,2359.0,955.0,7.0,11.0,999.0,999.0,113.0


Get number of weight values that == 999

In [210]:
#Count the number of weight values == 999
count = (saf_data['weight'] == 999).sum()
count

580

In [211]:
#Count the number of perobs values == 955
count = (saf_data['perobs'] > 600).sum()
count

8

Transpose the results provided by describe() to make the results more readable

In [212]:
saf_analysis_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,793112.0,2012.443,0.7299806,2012.0,2012.0,2012.0,2013.0,2015.0
pct,793112.0,67.65802,32.66956,1.0,42.0,71.0,101.0,123.0
datestop,793112.0,5263720.0,3320063.0,1012012.0,2232012.0,4262013.0,8062012.0,12312015.0
timestop,793112.0,1406.612,746.4108,0.0,941.0,1610.0,2030.0,2359.0
perobs,793112.0,2.491459,5.479971,0.0,1.0,1.0,2.0,955.0
ht_feet,793112.0,5.187991,0.4018196,3.0,5.0,5.0,5.0,7.0
ht_inch,793112.0,6.393265,3.392153,0.0,4.0,7.0,9.0,11.0
weight,793112.0,169.6316,37.58799,0.0,150.0,169.0,180.0,999.0
age,793112.0,28.89284,25.8607,0.0,19.0,24.0,34.0,999.0
detailCM,260201.0,38.96408,26.78861,1.0,20.0,27.0,46.0,113.0


Use the describe() dataframe method of saf_analysis_data and instruct it to include the object type columns

In [213]:
saf_analysis_data.describe(include=['object'])

Unnamed: 0,city,xcoord,ycoord,detailcm,arstmade,sumissue,frisked,searched,pistol,riflshot,asltweap,knifcuti,machgun,othrweap,crimsusp,sex,race,inout,trhsloc
count,793112,793112.0,793112.0,532911,793112,793112,793112,793112,793112,793112,793112,793112,793112,793112,793108,793112,793112,793112,793112
unique,7,92276.0,108051.0,185,2,2,2,2,4,4,4,4,3,4,9219,3,8,2,3
top,BROOKLYN,,,20,N,N,Y,N,N,N,N,N,N,N,FEL,M,B,O,P
freq,280831,23340.0,23340.0,84425,734488,756943,454485,719008,780917,782218,782186,773321,782246,779540,154166,727200,424947,621007,618309


Show the mode of each column in saf_analysis_data - transpose and print

In [214]:
saf_analysis_data.mode().transpose()

Unnamed: 0,0
year,2012
pct,75
datestop,2102012
timestop,2130
perobs,1.0
ht_feet,5
ht_inch,8
city,BROOKLYN
weight,160
xcoord,


# SECTION 2: Crime data

Import data into pandas dataframe

In [215]:
#Load Crime data file
crime_data = pd.read_csv('C:\\Users\\mekna\\OneDrive\\Documents\\PSU\\DAAN_822\\Project\\Data_Warehouse\\NYPD_Crime_Data.csv')

First check if any rows in the saf_data dataset are not unique (look for duplicates)

In [216]:
# Code ref: Gemini AI
# Check for duplicate rows
unique_rows = not crime_data.duplicated().any()

if unique_rows:
    print("No duplicates present")
else:
    print("Duplicates present")

Duplicates present


In [217]:
duplicates = crime_data.duplicated()
duplicates

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
30      False
31      False
32      False
33      False
34      False
35      False
36      False
37      False
38      False
39      False
40      False
41      False
42      False
43      False
44      False
45      False
46      False
47      False
48      False
49      False
50      False
51      False
52      False
53      False
54      False
55      False
56      False
57      False
58      False
59      False
60      False
61      False
62      False
63      False
64      False
65      False
66      False
67      False
68      False
69      False
70      False
71    

Create and print dataframe to verify names of columns in crime_data

In [218]:
columns = pd.DataFrame(list(crime_data.columns.values))

columns

Unnamed: 0,0
0,Year
1,Reason
2,Status
3,Category
4,Value


Create and print dataframe to show data type of each column in crime_data

In [219]:
data_types = pd.DataFrame(crime_data.dtypes,columns=['Data Type']) 

data_types 

Unnamed: 0,Data Type
Year,int64
Reason,object
Status,object
Category,object
Value,float64


Manual inspection reveals no missing values in crime_data

In [220]:
crime_data

Unnamed: 0,Year,Reason,Status,Category,Value
0,2012,Misdemeanor Criminal Mischief,Victim,AMER IND,0.007
1,2012,Misdemeanor Criminal Mischief,Victim,ASIAN/PAC.ISL,0.084
2,2012,Misdemeanor Criminal Mischief,Victim,BLACK,0.365
3,2012,Misdemeanor Criminal Mischief,Victim,WHITE,0.289
4,2012,Misdemeanor Criminal Mischief,Victim,HISPANIC,0.254
5,2012,Misdemeanor Criminal Mischief,Victim,Total Victims/Suspects/Arrests,40985.0
6,2012,Misdemeanor Criminal Mischief,Victim,Known Race Ethnicity,25282.0
7,2012,Misdemeanor Criminal Mischief,Victim,% of Incidents With Race/Eth. Known,0.617
8,2012,Misdemeanor Criminal Mischief,Suspect,AMER IND,0.003
9,2012,Misdemeanor Criminal Mischief,Suspect,ASIAN/PAC.ISL,0.032


Create and print dataframe with the count of present values in each column of crime_data

In [221]:
present_data_counts = pd.DataFrame(crime_data.count(), columns=['Present Values']) 

present_data_counts 

Unnamed: 0,Present Values
Year,1440
Reason,1440
Status,1440
Category,1440
Value,1440


Create and print dataframe count of unique values in each column of crime_data

In [222]:
unique_value_counts = pd.DataFrame(columns=['Unique Values']) 

for v in list(crime_data.columns.values): 
    unique_value_counts.loc[v] = [crime_data[v].nunique()] 

unique_value_counts 

Unnamed: 0,Unique Values
Year,4
Reason,15
Status,10
Category,13
Value,820


Print the unique values for each column in crime_data to find other missing / incorrect data

In [223]:
pd.set_option('display.max_columns', None)

# Check columns for unique values
for col in crime_data:
    print(f"\nUnique values in {col}: {crime_data[col].unique()}")


Unique values in Year: [2012 2013 2014 2015]

Unique values in Reason: ['Misdemeanor Criminal Mischief' 'Murder and Non-Negligent Manslaughter'
 'Rape ' 'Other Felony Sex Crimes' 'Robbery' 'Felonious Assault'
 'Grand Larceny' 'Misdemeanor Sex Crimes'
 'Misdemeanor Assault and Related Offenses' 'Petit Larceny'
 'Shootings (any crime where victim struck with bullet) '
 'Firearm Arrests (satisfying specific selection criteria)'
 'Proactive Offenses (Drugs) Arrests & Allegations'
 'Proactive Offenses (Property)'
 'Race/Ethnicity of Felony and Misdemeanor Juvenile Victim, Suspects and Arrestees  ']

Unique values in Status: ['Victim' 'Suspect' 'Arrestee' 'Felony' 'Misdemeanor' 'Suspects'
 'Property (Fel.)' 'Property (Misd.)' 'Victims' 'Arrestees']

Unique values in Category: ['AMER IND' 'ASIAN/PAC.ISL' 'BLACK' 'WHITE' 'HISPANIC'
 'Total Victims/Suspects/Arrests' 'Known Race Ethnicity'
 '% of Incidents With Race/Eth. Known' 'Total Arrests'
 'Total Arrests/Allegations' ' HISPANIC' 'Known Rac

Print report on crime_data

In [224]:
print("\nData Quality Report") 

print("Total records: {}".format(len(crime_data.index)))


Data Quality Report
Total records: 1440


Use the describe function to generate summary stats for the entire crime_data dataset

In [225]:
crime_data.describe() 

Unnamed: 0,Year,Value
count,1440.0,1440.0
mean,2013.5,3499.715562
std,1.118422,10285.914657
min,2012.0,0.0
25%,2012.75,0.06375
50%,2013.5,0.3775
75%,2014.25,61.25
max,2015.0,85066.0


Transpose the results provided by describe() to make the results more readable

In [226]:
crime_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,1440.0,2013.5,1.118422,2012.0,2012.75,2013.5,2014.25,2015.0
Value,1440.0,3499.715562,10285.914657,0.0,0.06375,0.3775,61.25,85066.0


Use the describe() dataframe method of crime_data and instruct it to include the object type columns

In [227]:
crime_data.describe(include=['object'])

Unnamed: 0,Reason,Status,Category
count,1440,1440,1440
unique,15,10,13
top,Misdemeanor Criminal Mischief,Arrestee,AMER IND
freq,192,416,180


Show the mode of each column in crime_data - transpose and print

In [228]:
crime_data.mode().transpose()

Unnamed: 0,0,1,2,3
Year,2012,2013,2014,2015
Reason,Misdemeanor Criminal Mischief,,,
Status,Arrestee,,,
Category,AMER IND,ASIAN/PAC.ISL,BLACK,WHITE
Value,0.003,,,


# SECTION 3: Population data

Import data into pandas dataframe

In [229]:
#Load Population data file
census_data = pd.read_csv('C:\\Users\\mekna\\OneDrive\\Documents\\PSU\\DAAN_822\\Project\\Data_Warehouse\\NYC_Census_Data.csv')

First check if any rows in the saf_data dataset are not unique (look for duplicates)

In [230]:
# Code ref: Gemini AI
# Check for duplicate rows
unique_rows = not census_data.duplicated().any()

if unique_rows:
    print("No duplicates present")
else:
    print("Duplicates present")

No duplicates present


Create and print dataframe to verify names of columns in census_data

In [231]:
columns = pd.DataFrame(list(census_data.columns.values))

columns

Unnamed: 0,0
0,YEAR
1,BRONX TOTAL
2,BRONX WHITE
3,BRONX BLACK
4,BRONX ASIAN
5,BRONX NATIVE
6,BRONX PAC ISLANDER
7,BRONX OTHER
8,BRONX TOTAL OTHER
9,BRONX MULTI


Create and print dataframe to show data type of each column in census_data

In [232]:
data_types = pd.DataFrame(census_data.dtypes,columns=['Data Type']) 

data_types 

Unnamed: 0,Data Type
YEAR,int64
BRONX TOTAL,float64
BRONX WHITE,float64
BRONX BLACK,float64
BRONX ASIAN,float64
BRONX NATIVE,float64
BRONX PAC ISLANDER,float64
BRONX OTHER,float64
BRONX TOTAL OTHER,float64
BRONX MULTI,float64


Manual inspection reveals no missing values in census_data

In [233]:
census_data

Unnamed: 0,YEAR,BRONX TOTAL,BRONX WHITE,BRONX BLACK,BRONX ASIAN,BRONX NATIVE,BRONX PAC ISLANDER,BRONX OTHER,BRONX TOTAL OTHER,BRONX MULTI,BRONX HISPANIC,BROOKLYN TOTAL,BROOKLYN WHITE,BROOKLYN BLACK,BROOKLYN ASIAN,BROOKLYN NATIVE,BROOKLYN PAC ISLANDER,BROOKLYN OTHER,BROOKLYN TOTAL OTHER,BROOKLYN MULTI,BROOKLYN HISPANIC,MANHATTAN TOTAL,MANHATTAN WHITE,MANHATTAN BLACK,MANHATTAN ASIAN,MANHATTAN NATIVE,MANHATTAN PAC ISLANDER,MANHATTAN OTHER,MANHATTAN TOTAL OTHER,MANHATTAN MULTI,MANHATTAN HISPANIC,QUEENS TOTAL,QUEENS WHITE,QUEENS BLACK,QUEENS ASIAN,QUEENS NATIVE,QUEENS PAC ISLANDER,QUEENS OTHER,QUEENS TOTAL OTHER,QUEENS MULTI,QUEENS HISPANIC,STATEN ISLAND TOTAL,STATEN ISLAND WHITE,STATEN ISLAND BLACK,STATEN ISLAND ASIAN,STATEN ISLAND NATIVE,STATEN ISLAND PAC ISLANDER,STATEN ISLAND OTHER,STATEN ISLAND TOTAL OTHER,STATEN ISLAND MULTI,STATEN ISLAND HISPANIC
0,2010,1385108.0,151209.0,416695.0,47335.0,3460.0,398.0,8636.0,12494.0,15962.0,741413.0,2504700.0,893306.0,799066.0,260129.0,4638.0,633.0,10633.0,15904.0,40010.0,496285.0,1585873.0,761493.0,205340.0,177624.0,2144.0,533.0,5205.0,7882.0,29957.0,403577.0,2230722.0,616727.0,395881.0,508334.0,6490.0,1094.0,32339.0,39923.0,56107.0,613750.0,468730.0,300169.0,44313.0,34697.0,695.0,137.0,1028.0,1860.0,6640.0,81051.0
1,2011,1393862.6,149167.7,416964.8,49378.1,,,,13231.2,17202.8,747918.0,2527837.4,900818.1,792129.0,271193.7,,,,18071.5,47326.0,498299.1,1596710.8,764673.1,204765.2,181824.0,,,,8705.0,33260.2,403483.3,2248196.2,609990.1,394430.4,523158.9,,,,42548.2,58907.5,619161.1,471431.7,297950.2,44565.2,37102.6,,,,2064.0,7107.8,82641.9
2,2012,1402617.2,147126.4,417234.6,51421.2,,,,13968.4,18443.6,754423.0,2550974.8,908330.2,785192.0,282258.4,,,,20239.0,54642.0,500313.2,1607548.6,767853.2,204190.4,186024.0,,,,9528.0,36563.4,403389.6,2265670.4,603253.2,392979.8,537983.8,,,,45173.4,61708.0,624572.2,474133.4,295731.4,44817.4,39508.2,,,,2268.0,7575.6,84232.8
3,2013,1411371.8,145085.1,417504.4,53464.3,,,,14705.6,19684.4,760928.0,2574112.2,915842.3,778255.0,293323.1,,,,22406.5,61958.0,502327.3,1618386.4,771033.3,203615.6,190224.0,,,,10351.0,39866.6,403295.9,2283144.6,596516.3,391529.2,552808.7,,,,47798.6,64508.5,629983.3,476835.1,293512.6,45069.6,41913.8,,,,2472.0,8043.4,85823.7
4,2014,1420126.4,143043.8,417774.2,55507.4,,,,15442.8,20925.2,767433.0,2597249.6,923354.4,771318.0,304387.8,,,,24574.0,69274.0,504341.4,1629224.2,774213.4,203040.8,194424.0,,,,11174.0,43169.8,403202.2,2300618.8,589779.4,390078.6,567633.6,,,,50423.8,67309.0,635394.4,479536.8,291293.8,45321.8,44319.4,,,,2676.0,8511.2,87414.6
5,2015,1428881.0,141002.5,418044.0,57550.5,,,,16180.0,22166.0,773938.0,2620387.0,930866.5,764381.0,315452.5,,,,26741.5,76590.0,506355.5,1640062.0,777393.5,202466.0,198624.0,,,,11997.0,46473.0,403108.5,2318093.0,583042.5,388628.0,582458.5,,,,53049.0,70109.5,640805.5,482238.5,289075.0,45574.0,46725.0,,,,2880.0,8979.0,89005.5
6,2016,1437635.6,138961.2,418313.8,59593.6,,,,16917.2,23406.8,780443.0,2643524.4,938378.6,757444.0,326517.2,,,,28909.0,83906.0,508369.6,1650899.8,780573.6,201891.2,202824.0,,,,12820.0,49776.2,403014.8,2335567.2,576305.6,387177.4,597283.4,,,,55674.2,72910.0,646216.6,484940.2,286856.2,45826.2,49130.6,,,,3084.0,9446.8,90596.4
7,2017,1446390.2,136919.9,418583.6,61636.7,,,,17654.4,24647.6,786948.0,2666661.8,945890.7,750507.0,337581.9,,,,31076.5,91222.0,510383.7,1661737.6,783753.7,201316.4,207024.0,,,,13643.0,53079.4,402921.1,2353041.4,569568.7,385726.8,612108.3,,,,58299.4,75710.5,651627.7,487641.9,284637.4,46078.4,51536.2,,,,3288.0,9914.6,92187.3
8,2018,1455144.8,134878.6,418853.4,63679.8,,,,18391.6,25888.4,793453.0,2689799.2,953402.8,743570.0,348646.6,,,,33244.0,98538.0,512397.8,1672575.4,786933.8,200741.6,211224.0,,,,14466.0,56382.6,402827.4,2370515.6,562831.8,384276.2,626933.2,,,,60924.6,78511.0,657038.8,490343.6,282418.6,46330.6,53941.8,,,,3492.0,10382.4,93778.2
9,2019,1463899.4,132837.3,419123.2,65722.9,,,,19128.8,27129.2,799958.0,2712936.6,960914.9,736633.0,359711.3,,,,35411.5,105854.0,514411.9,1683413.2,790113.9,200166.8,215424.0,,,,15289.0,59685.8,402733.7,2387989.8,556094.9,382825.6,641758.1,,,,63549.8,81311.5,662449.9,493045.3,280199.8,46582.8,56347.4,,,,3696.0,10850.2,95369.1


Create and print dataframe with the count of present values in each column of census_data

In [234]:
present_data_counts = pd.DataFrame(census_data.count(), columns=['Present Values']) 

present_data_counts 

Unnamed: 0,Present Values
YEAR,11
BRONX TOTAL,11
BRONX WHITE,11
BRONX BLACK,11
BRONX ASIAN,11
BRONX NATIVE,1
BRONX PAC ISLANDER,1
BRONX OTHER,1
BRONX TOTAL OTHER,11
BRONX MULTI,11


Create and print dataframe count of unique values in each column of census_data

In [235]:
unique_value_counts = pd.DataFrame(columns=['Unique Values']) 

for v in list(census_data.columns.values): 
    unique_value_counts.loc[v] = [census_data[v].nunique()] 

unique_value_counts 

Unnamed: 0,Unique Values
YEAR,11
BRONX TOTAL,11
BRONX WHITE,11
BRONX BLACK,11
BRONX ASIAN,11
BRONX NATIVE,1
BRONX PAC ISLANDER,1
BRONX OTHER,1
BRONX TOTAL OTHER,11
BRONX MULTI,11


Print the unique values for each column in census_data to find other missing / incorrect data

In [236]:
pd.set_option('display.max_columns', None)

# Check columns for unique values
for col in census_data:
    print(f"\nUnique values in {col}: {census_data[col].unique()}")


Unique values in YEAR: [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020]

Unique values in BRONX TOTAL: [1385108.  1393862.6 1402617.2 1411371.8 1420126.4 1428881.  1437635.6
 1446390.2 1455144.8 1463899.4 1472654. ]

Unique values in BRONX WHITE: [151209.  149167.7 147126.4 145085.1 143043.8 141002.5 138961.2 136919.9
 134878.6 132837.3 130796. ]

Unique values in BRONX BLACK: [416695.  416964.8 417234.6 417504.4 417774.2 418044.  418313.8 418583.6
 418853.4 419123.2 419393. ]

Unique values in BRONX ASIAN: [47335.  49378.1 51421.2 53464.3 55507.4 57550.5 59593.6 61636.7 63679.8
 65722.9 67766. ]

Unique values in BRONX NATIVE: [3460.   nan]

Unique values in BRONX PAC ISLANDER: [398.  nan]

Unique values in BRONX OTHER: [8636.   nan]

Unique values in BRONX TOTAL OTHER: [12494.  13231.2 13968.4 14705.6 15442.8 16180.  16917.2 17654.4 18391.6
 19128.8 19866. ]

Unique values in BRONX MULTI: [15962.  17202.8 18443.6 19684.4 20925.2 22166.  23406.8 24647.6 25888.4
 27129.2 28370

Print report on census_data

In [237]:
print("\nData Quality Report") 

print("Total records: {}".format(len(census_data.index)))


Data Quality Report
Total records: 11


Use the describe function to generate summary stats for the entire census_data dataset

In [238]:
census_data.describe() 

Unnamed: 0,YEAR,BRONX TOTAL,BRONX WHITE,BRONX BLACK,BRONX ASIAN,BRONX NATIVE,BRONX PAC ISLANDER,BRONX OTHER,BRONX TOTAL OTHER,BRONX MULTI,BRONX HISPANIC,BROOKLYN TOTAL,BROOKLYN WHITE,BROOKLYN BLACK,BROOKLYN ASIAN,BROOKLYN NATIVE,BROOKLYN PAC ISLANDER,BROOKLYN OTHER,BROOKLYN TOTAL OTHER,BROOKLYN MULTI,BROOKLYN HISPANIC,MANHATTAN TOTAL,MANHATTAN WHITE,MANHATTAN BLACK,MANHATTAN ASIAN,MANHATTAN NATIVE,MANHATTAN PAC ISLANDER,MANHATTAN OTHER,MANHATTAN TOTAL OTHER,MANHATTAN MULTI,MANHATTAN HISPANIC,QUEENS TOTAL,QUEENS WHITE,QUEENS BLACK,QUEENS ASIAN,QUEENS NATIVE,QUEENS PAC ISLANDER,QUEENS OTHER,QUEENS TOTAL OTHER,QUEENS MULTI,QUEENS HISPANIC,STATEN ISLAND TOTAL,STATEN ISLAND WHITE,STATEN ISLAND BLACK,STATEN ISLAND ASIAN,STATEN ISLAND NATIVE,STATEN ISLAND PAC ISLANDER,STATEN ISLAND OTHER,STATEN ISLAND TOTAL OTHER,STATEN ISLAND MULTI,STATEN ISLAND HISPANIC
count,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,11.0,11.0
mean,2015.0,1428881.0,141002.5,418044.0,57550.5,3460.0,398.0,8636.0,16180.0,22166.0,773938.0,2620387.0,930866.5,764381.0,315452.5,4638.0,633.0,10633.0,26741.5,76590.0,506355.5,1640062.0,777393.5,202466.0,198624.0,2144.0,533.0,5205.0,11997.0,46473.0,403108.5,2318093.0,583042.5,388628.0,582458.5,6490.0,1094.0,32339.0,53049.0,70109.5,640805.5,482238.5,289075.0,45574.0,46725.0,695.0,137.0,1028.0,2880.0,8979.0,89005.5
std,3.316625,29035.72,6770.226185,894.825368,6776.196109,,,,2445.015795,4115.26804,21574.644261,76738.07,24914.817088,23007.426171,36697.458318,,,,7188.784233,24264.426966,6680.01399,35944.92,10547.198496,1906.395929,13929.824119,,,,2729.582202,10955.475008,310.767743,57955.36,22343.76955,4811.095921,49168.630855,,,,8706.8034,9288.207725,17946.588403,8960.525196,7358.927085,836.452772,7978.472596,,,,676.591457,1551.517077,5276.418379
min,2010.0,1385108.0,130796.0,416695.0,47335.0,3460.0,398.0,8636.0,12494.0,15962.0,741413.0,2504700.0,893306.0,729696.0,260129.0,4638.0,633.0,10633.0,15904.0,40010.0,496285.0,1585873.0,761493.0,199592.0,177624.0,2144.0,533.0,5205.0,7882.0,29957.0,402640.0,2230722.0,549358.0,381375.0,508334.0,6490.0,1094.0,32339.0,39923.0,56107.0,613750.0,468730.0,277981.0,44313.0,34697.0,695.0,137.0,1028.0,1860.0,6640.0,81051.0
25%,2012.5,1406994.0,135899.25,417369.5,52442.75,3460.0,398.0,8636.0,14337.0,19064.0,757675.5,2562544.0,912086.25,747038.5,287790.75,4638.0,633.0,10633.0,21322.75,58300.0,501320.25,1612968.0,769443.25,201029.0,188124.0,2144.0,533.0,5205.0,9939.5,38215.0,402874.25,2274408.0,566200.25,385001.5,545396.25,6490.0,1094.0,32339.0,46486.0,63108.25,627277.75,475484.25,283528.0,44943.5,40711.0,695.0,137.0,1028.0,2370.0,7809.5,85028.25
50%,2015.0,1428881.0,141002.5,418044.0,57550.5,3460.0,398.0,8636.0,16180.0,22166.0,773938.0,2620387.0,930866.5,764381.0,315452.5,4638.0,633.0,10633.0,26741.5,76590.0,506355.5,1640062.0,777393.5,202466.0,198624.0,2144.0,533.0,5205.0,11997.0,46473.0,403108.5,2318093.0,583042.5,388628.0,582458.5,6490.0,1094.0,32339.0,53049.0,70109.5,640805.5,482238.5,289075.0,45574.0,46725.0,695.0,137.0,1028.0,2880.0,8979.0,89005.5
75%,2017.5,1450768.0,146105.75,418718.5,62658.25,3460.0,398.0,8636.0,18023.0,25268.0,790200.5,2678230.0,949646.75,781723.5,343114.25,4638.0,633.0,10633.0,32160.25,94880.0,511390.75,1667156.0,785343.75,203903.0,209124.0,2144.0,533.0,5205.0,14054.5,54731.0,403342.75,2361778.0,599884.75,392254.5,619520.75,6490.0,1094.0,32339.0,59612.0,77110.75,654333.25,488992.75,294622.0,46204.5,52739.0,695.0,137.0,1028.0,3390.0,10148.5,92982.75
max,2020.0,1472654.0,151209.0,419393.0,67766.0,3460.0,398.0,8636.0,19866.0,28370.0,806463.0,2736074.0,968427.0,799066.0,370776.0,4638.0,633.0,10633.0,37579.0,113170.0,516426.0,1694251.0,793294.0,205340.0,219624.0,2144.0,533.0,5205.0,16112.0,62989.0,403577.0,2405464.0,616727.0,395881.0,656583.0,6490.0,1094.0,32339.0,66175.0,84112.0,667861.0,495747.0,300169.0,46835.0,58753.0,695.0,137.0,1028.0,3900.0,11318.0,96960.0


Transpose the results provided by describe() to make the results more readable

In [239]:
census_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
YEAR,11.0,2015.0,3.316625,2010.0,2012.5,2015.0,2017.5,2020.0
BRONX TOTAL,11.0,1428881.0,29035.72339,1385108.0,1406994.5,1428881.0,1450767.5,1472654.0
BRONX WHITE,11.0,141002.5,6770.226185,130796.0,135899.25,141002.5,146105.75,151209.0
BRONX BLACK,11.0,418044.0,894.825368,416695.0,417369.5,418044.0,418718.5,419393.0
BRONX ASIAN,11.0,57550.5,6776.196109,47335.0,52442.75,57550.5,62658.25,67766.0
BRONX NATIVE,1.0,3460.0,,3460.0,3460.0,3460.0,3460.0,3460.0
BRONX PAC ISLANDER,1.0,398.0,,398.0,398.0,398.0,398.0,398.0
BRONX OTHER,1.0,8636.0,,8636.0,8636.0,8636.0,8636.0,8636.0
BRONX TOTAL OTHER,11.0,16180.0,2445.015795,12494.0,14337.0,16180.0,18023.0,19866.0
BRONX MULTI,11.0,22166.0,4115.26804,15962.0,19064.0,22166.0,25268.0,28370.0


Show the mode of each column in census_data - transpose and print

In [240]:
census_data.mode().transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
YEAR,2010.0,2011.0,2012.0,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0
BRONX TOTAL,1385108.0,1393862.6,1402617.2,1411371.8,1420126.4,1428881.0,1437635.6,1446390.2,1455144.8,1463899.4,1472654.0
BRONX WHITE,130796.0,132837.3,134878.6,136919.9,138961.2,141002.5,143043.8,145085.1,147126.4,149167.7,151209.0
BRONX BLACK,416695.0,416964.8,417234.6,417504.4,417774.2,418044.0,418313.8,418583.6,418853.4,419123.2,419393.0
BRONX ASIAN,47335.0,49378.1,51421.2,53464.3,55507.4,57550.5,59593.6,61636.7,63679.8,65722.9,67766.0
BRONX NATIVE,3460.0,,,,,,,,,,
BRONX PAC ISLANDER,398.0,,,,,,,,,,
BRONX OTHER,8636.0,,,,,,,,,,
BRONX TOTAL OTHER,12494.0,13231.2,13968.4,14705.6,15442.8,16180.0,16917.2,17654.4,18391.6,19128.8,19866.0
BRONX MULTI,15962.0,17202.8,18443.6,19684.4,20925.2,22166.0,23406.8,24647.6,25888.4,27129.2,28370.0
