# And now introducing Pandas...

- Named after a cute animal?
- Open source (free) Python library.
- Most widely used package for data analysis.
- Future-proof: used for AI and machine learning tasks.

## Pandas in action


### **<a href="https://www.bloomberg.com/graphics/2022-wells-fargo-black-home-loan-refinancing/">Wells Fargo Rejected Half Its Black Applicants in Mortgage Refinancing Boom</a>**

To report this investigation, we had to scour 8 million records obtained through the Home Mortgage Disclosure Act, for completed applications to refinance conventional loans in 2020. The size of the dataset, the convoluted racial breakdown categories, the types of loans etc were overwhelming. Using Python was the only way to categorize the data to find that:

> While Black applicants had lower approval rates than White ones at all major lenders, the data show, Wells Fargo had the biggest disparity and was alone in rejecting more Black homeowners than it accepted...Wells Fargo, which declined to comment about individual customers, didn’t dispute Bloomberg’s statistical findings.

**-- Ann Choi**, Data Investigative Reporter @Bloomberg










# Import & Explore

Before we can analyze data, we have to learn how to:

- import different types of spreadsheet files,
- get a quick sense of what our data holds,
- explore different sections of our data in Pandas context.

## Import Pandas

Unlike the pure Python we've written so far, we have to import Pandas which brings ALL its functionality into our iPython Notebook (```.ipynb```).


In [1]:
## import pandas
import pandas as pd

## Import CVS data into our notebook

<a href="https://drive.google.com/file/d/1ONMD9surl8Ulu-BKCepe4V1Lt_JHIyaP/view?usp=sharing">Download this folder data</a> and <a href="https://drive.google.com/file/d/1Y_yAzd2g_0pqh4jvGV0zGPaDVQosZBIv/view?usp=sharing">this file</a>. 

Place them in the correct location. This is imperative.

### Read a CSV file into your notebook

#### Bring in ```insurance.csv``` into Pandas

- You must provide a ```path``` to the file you are importing.

Syntax: ```pd.read_csv("path_to_file")```

In [2]:
## simply read the csv file when data file is at the same level
pd.read_csv("insurance.csv")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [3]:
## simply read the csv file when data file is in a different folder
## provide the path

pd.read_csv("raw_data/insurance.csv")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


### Read and hold CSV content into a ```dataframe```

The naming convention is to call it a ```df```

In [4]:
## read and store csv in a dataframe

df= pd.read_csv("raw_data/insurance.csv")

## Explore our dataframe

In [5]:
## call the df
## returns the first 5 and last 5 rows
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [6]:
## call the top only
## returns first 5
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
## call the top n rows
df.head(12)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [8]:
## call the last 5
df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [9]:
## call the last n
df.tail(9)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1329,52,male,38.6,2,no,southwest,10325.206
1330,57,female,25.74,2,no,southeast,12629.1656
1331,23,female,33.4,0,no,southwest,10795.93733
1332,52,female,44.7,3,no,southwest,11411.685
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [10]:
## call a random sample n number of rows
df.sample(20)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
491,61,female,25.08,0,no,southeast,24513.09126
775,51,male,33.33,3,no,southeast,10560.4917
327,45,male,36.48,2,yes,northwest,42760.5022
764,45,female,25.175,2,no,northeast,9095.06825
456,55,female,30.14,2,no,southeast,11881.9696
376,39,female,24.89,3,yes,northeast,21659.9301
44,38,male,37.05,1,no,northeast,6079.6715
1027,23,male,18.715,0,no,northwest,21595.38229
1021,22,female,31.02,3,yes,southeast,35595.5898
1179,31,male,29.81,0,yes,southeast,19350.3689


In [11]:
## get basic overview of the data
# type(df)
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


## Importing a simple Excel file

Bring in ```global_life.xlsx``` into Pandas

In [12]:
## step 1. confirm that read returns a df
## step 2. place it into a df, call the df
df = pd.read_excel("raw_data/global_life.xlsx")
df

Unnamed: 0,Country Name,Country Code
0,Aruba,ABW
1,Afghanistan,AFG
2,Angola,AGO
3,Albania,ALB
4,Andorra,AND
...,...,...
259,Kosovo,XKX
260,"Yemen, Rep.",YEM
261,South Africa,ZAF
262,Zambia,ZMB


In [13]:
## call the top 15 values
df.head(15)

Unnamed: 0,Country Name,Country Code
0,Aruba,ABW
1,Afghanistan,AFG
2,Angola,AGO
3,Albania,ALB
4,Andorra,AND
5,Arab World,ARB
6,United Arab Emirates,ARE
7,Argentina,ARG
8,Armenia,ARM
9,American Samoa,ASM


In [14]:
## call the last 3 values
df.tail(3)

Unnamed: 0,Country Name,Country Code
261,South Africa,ZAF
262,Zambia,ZMB
263,Zimbabwe,ZWE


In [15]:
## show 13 random rows
df.sample(13)

Unnamed: 0,Country Name,Country Code
118,Kazakhstan,KAZ
193,Paraguay,PRY
139,Lesotho,LSO
186,Palau,PLW
202,South Asia,SAS
49,Curacao,CUW
29,Brunei Darussalam,BRN
201,Rwanda,RWA
229,Europe & Central Asia (IDA & IBRD countries),TEC
259,Kosovo,XKX


## Read a specific sheet in an Excel file

From ```global_life.xlsx```, import only ```global life expectancy``` sheet (Target  from an .xlsx file)

In [16]:
## read the excel file and provide the sheet_name parameter
df1 = pd.read_excel("raw_data/global_life.xlsx", sheet_name = 1)
df1

Unnamed: 0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,67.762,...,75.15800,75.299000,75.441000,75.583000,75.725000,75.868000,76.010000,76.152000,,
1,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.446,32.962,33.471,33.971,34.463,34.948,35.430,...,61.55300,62.054000,62.525000,62.966000,63.377000,63.763000,64.130000,64.486000,,
2,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.524,37.811,38.113,38.430,38.760,39.102,39.454,...,56.33000,57.236000,58.054000,58.776000,59.398000,59.925000,60.379000,60.782000,,
3,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.283,63.301,64.190,64.914,65.463,65.850,66.110,...,76.91400,77.252000,77.554000,77.813000,78.025000,78.194000,78.333000,78.458000,,
4,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,XKX,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,70.14878,70.497561,70.797561,71.097561,71.346341,71.646341,71.946341,72.195122,,
260,YEM,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,29.919,30.163,30.500,30.943,31.501,32.175,32.960,...,65.76800,65.920000,66.016000,66.066000,66.085000,66.087000,66.086000,66.096000,,
261,ZAF,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,48.406,48.777,49.142,49.509,49.888,50.284,50.705,...,58.89500,60.060000,61.099000,61.968000,62.649000,63.153000,63.538000,63.857000,,
262,ZMB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,46.687,47.084,47.446,47.772,48.068,48.351,48.643,...,57.12600,58.502000,59.746000,60.831000,61.737000,62.464000,63.043000,63.510000,,


In [17]:
## read the excel file and provide the sheet_name parameter
df1 = pd.read_excel("raw_data/global_life.xlsx", sheet_name = "global life expectancy")
df1

Unnamed: 0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,67.762,...,75.15800,75.299000,75.441000,75.583000,75.725000,75.868000,76.010000,76.152000,,
1,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.446,32.962,33.471,33.971,34.463,34.948,35.430,...,61.55300,62.054000,62.525000,62.966000,63.377000,63.763000,64.130000,64.486000,,
2,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.524,37.811,38.113,38.430,38.760,39.102,39.454,...,56.33000,57.236000,58.054000,58.776000,59.398000,59.925000,60.379000,60.782000,,
3,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.283,63.301,64.190,64.914,65.463,65.850,66.110,...,76.91400,77.252000,77.554000,77.813000,78.025000,78.194000,78.333000,78.458000,,
4,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,XKX,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,70.14878,70.497561,70.797561,71.097561,71.346341,71.646341,71.946341,72.195122,,
260,YEM,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,29.919,30.163,30.500,30.943,31.501,32.175,32.960,...,65.76800,65.920000,66.016000,66.066000,66.085000,66.087000,66.086000,66.096000,,
261,ZAF,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,48.406,48.777,49.142,49.509,49.888,50.284,50.705,...,58.89500,60.060000,61.099000,61.968000,62.649000,63.153000,63.538000,63.857000,,
262,ZMB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,46.687,47.084,47.446,47.772,48.068,48.351,48.643,...,57.12600,58.502000,59.746000,60.831000,61.737000,62.464000,63.043000,63.510000,,


## Special import cases

Import the sheet ```FY18 Business``` from ```sba-disaster-loans-18.xlsx```.

In [18]:
## import here
## what do you that's not right?
df2 = pd.read_excel("raw_data/sba-disaster-loans-18.xlsx", sheet_name="FY18 Business" )
df2

Unnamed: 0,SBA Disaster Loan Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Report Run Date: 3/19/2019,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,,,,,,,Report Run Time: 9:57:31AM,,,,,,,
1,Reporting Period: 10/01/2017 - 09/30/2018,,,,,,,,,,,,,,
2,Declaration Type(s): 'ALL',,,,,,,,,,,,,,
3,SBA Physical Declaration Number,SBA EIDL Declaration Number,FEMA Disaster Number,SBA Disaster Number,Damaged Property City Name,Damaged Property Zip Code,Damaged Property County/Parish Name,Damaged Property State Code,Total Verified Loss,Verified Loss Real Estate,Verified Loss Content,Total Approved Loan Amount,Approved Amount Real Estate,Approved Amount Content,Approved Amount EIDL
4,,15337,,ZZ-00013,LOUISVILLE,40299,JEFFERSON,KY,0,0,0,49400,0,0,49400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
825,15622,15623,4382,CA-00288,SHASTA,96087,Shasta,CA,227915,198265,29650,195100,161400,29700,4000
826,15626,15627,,NE-00071,PENDER,68047,Thurston,NE,7592,7592,0,7600,7600,0,
827,15642,15643,,NM-00062,SANTA FE,87505,Santa Fe,NM,24472,22872,1600,24500,22900,1600,
828,,,,,,,,,,,,,,,


## Read a specific sheet in an Excel file but **this time skip the first few rows of formatting**

From ```SBA_Disaster_Loan_Data_FY18.xlsx```, import ```FY18 Business``` sheet (You have to skip the first 5 rows)


In [19]:
## read the excel file, provide the sheet_name parameter AND the skiprows parameter
## once it works, store in df
df2 = pd.read_excel("raw_data/sba-disaster-loans-18.xlsx", 
                    sheet_name="FY18 Business",
                   skiprows=4)
df2

Unnamed: 0,SBA Physical Declaration Number,SBA EIDL Declaration Number,FEMA Disaster Number,SBA Disaster Number,Damaged Property City Name,Damaged Property Zip Code,Damaged Property County/Parish Name,Damaged Property State Code,Total Verified Loss,Verified Loss Real Estate,Verified Loss Content,Total Approved Loan Amount,Approved Amount Real Estate,Approved Amount Content,Approved Amount EIDL
0,,15337.0,,ZZ-00013,LOUISVILLE,40299.0,JEFFERSON,KY,0.0,0.0,0.0,49400.0,0.0,0.0,49400.0
1,,15337.0,,ZZ-00013,NORWOOD YOUNG AMERICA,55368.0,CARVER,MN,0.0,0.0,0.0,49700.0,0.0,0.0,49700.0
2,,15359.0,,FL-00133,ALTAMONTE SPRINGS,32714.0,SEMINOLE,FL,0.0,0.0,0.0,7000.0,0.0,0.0,7000.0
3,,15359.0,,FL-00133,APOPKA,32712.0,ORANGE,FL,0.0,0.0,0.0,25000.0,0.0,0.0,25000.0
4,,15359.0,,FL-00133,AVENTURA,33180.0,MIAMI-DADE,FL,0.0,0.0,0.0,162300.0,0.0,0.0,162300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
821,15622,15623.0,4382.0,CA-00288,SHASTA,96087.0,Shasta,CA,227915.0,198265.0,29650.0,195100.0,161400.0,29700.0,4000.0
822,15626,15627.0,,NE-00071,PENDER,68047.0,Thurston,NE,7592.0,7592.0,0.0,7600.0,7600.0,0.0,
823,15642,15643.0,,NM-00062,SANTA FE,87505.0,Santa Fe,NM,24472.0,22872.0,1600.0,24500.0,22900.0,1600.0,
824,,,,,,,,,,,,,,,


## Congratulations!
You just learned to:

- import different types of spreadsheet files,
- get a quick sense of what our data holds,


These are steps you will take almost everytime.

In [20]:
## imnport flight delay data for 2017 and 2018

df17 = pd.read_csv("raw_data/flt_delays_17.csv")
df18 = pd.read_csv("raw_data/flt_delays_18.csv")
df18.head()

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018,1,MQ,Envoy Air,BIS,"Bismarck/Mandan, ND: Bismarck Municipal",5.0,3.0,1.0,0.06,...,0.0,0.0,0.0,0.0,104.0,54.0,1.0,49.0,0.0,0.0
1,2018,1,MQ,Envoy Air,BNA,"Nashville, TN: Nashville International",110.0,21.0,7.17,1.16,...,0.0,5.92,3.0,0.0,897.0,344.0,37.0,226.0,0.0,290.0
2,2018,1,MQ,Envoy Air,BOI,"Boise, ID: Boise Air Terminal",32.0,8.0,0.22,0.35,...,0.0,1.82,0.0,0.0,353.0,9.0,18.0,233.0,0.0,93.0
3,2018,1,MQ,Envoy Air,BPT,"Beaumont/Port Arthur, TX: Jack Brooks Regional",63.0,11.0,1.75,1.08,...,0.0,5.19,3.0,0.0,657.0,83.0,34.0,130.0,0.0,410.0
4,2018,1,MQ,Envoy Air,BUF,"Buffalo, NY: Buffalo Niagara International",31.0,12.0,0.82,3.0,...,0.0,1.55,0.0,0.0,484.0,27.0,136.0,207.0,0.0,114.0


In [21]:
df17.head()

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2017,3,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",145.0,26.0,10.83,0.0,...,0.0,9.31,0.0,0.0,935.0,437.0,0.0,193.0,0.0,305.0
1,2017,3,AA,American Airlines Inc.,ALB,"Albany, NY: Albany International",89.0,16.0,2.12,0.41,...,0.0,7.26,3.0,0.0,689.0,100.0,40.0,139.0,0.0,410.0
2,2017,3,AA,American Airlines Inc.,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",1205.0,241.0,78.79,4.13,...,0.95,82.14,9.0,3.0,12584.0,3827.0,372.0,2839.0,38.0,5508.0
3,2017,3,AA,American Airlines Inc.,AUS,"Austin, TX: Austin - Bergstrom International",749.0,111.0,36.77,3.42,...,0.05,47.26,2.0,0.0,5112.0,1398.0,219.0,752.0,1.0,2742.0
4,2017,3,AA,American Airlines Inc.,BDL,"Hartford, CT: Bradley International",410.0,102.0,28.99,2.59,...,0.0,25.41,22.0,0.0,3929.0,1185.0,232.0,1183.0,0.0,1329.0


In [22]:
list(df18.columns)

['year',
 ' month',
 'carrier',
 'carrier_name',
 'airport',
 'airport_name',
 'arr_flights',
 'arr_del15',
 'carrier_ct',
 ' weather_ct',
 'nas_ct',
 'security_ct',
 'late_aircraft_ct',
 'arr_cancelled',
 'arr_diverted',
 ' arr_delay',
 ' carrier_delay',
 'weather_delay',
 'nas_delay',
 'security_delay',
 'late_aircraft_delay']

In [23]:
list(df17.columns)

['year',
 ' month',
 'carrier',
 'carrier_name',
 'airport',
 'airport_name',
 'arr_flights',
 'arr_del15',
 'carrier_ct',
 ' weather_ct',
 'nas_ct',
 'security_ct',
 'late_aircraft_ct',
 'arr_cancelled',
 'arr_diverted',
 ' arr_delay',
 ' carrier_delay',
 'weather_delay',
 'nas_delay',
 'security_delay',
 'late_aircraft_delay']

In [24]:
list(df18.columns) == list(df17.columns)

True

In [25]:
df17.shape

(8384, 21)

In [26]:
df18.shape

(16791, 21)

In [27]:
type(df18.shape)

tuple

In [28]:
df18.shape[1]

21

In [29]:
df18.shape[0] + df17.shape[0]

25175

In [30]:
df17.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8384 entries, 0 to 8383
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   year                 8384 non-null   int64  
 1    month               8384 non-null   int64  
 2   carrier              8384 non-null   object 
 3   carrier_name         8384 non-null   object 
 4   airport              8384 non-null   object 
 5   airport_name         8384 non-null   object 
 6   arr_flights          8377 non-null   float64
 7   arr_del15            8372 non-null   float64
 8   carrier_ct           8377 non-null   float64
 9    weather_ct          8377 non-null   float64
 10  nas_ct               8377 non-null   float64
 11  security_ct          8377 non-null   float64
 12  late_aircraft_ct     8377 non-null   float64
 13  arr_cancelled        8377 non-null   float64
 14  arr_diverted         8377 non-null   float64
 15   arr_delay           8377 non-null   f

In [31]:
## concat the two dfs
pd.concat([df17, df18])

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2017,3,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",145.0,26.0,10.83,0.00,...,0.00,9.31,0.0,0.0,935.0,437.0,0.0,193.0,0.0,305.0
1,2017,3,AA,American Airlines Inc.,ALB,"Albany, NY: Albany International",89.0,16.0,2.12,0.41,...,0.00,7.26,3.0,0.0,689.0,100.0,40.0,139.0,0.0,410.0
2,2017,3,AA,American Airlines Inc.,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",1205.0,241.0,78.79,4.13,...,0.95,82.14,9.0,3.0,12584.0,3827.0,372.0,2839.0,38.0,5508.0
3,2017,3,AA,American Airlines Inc.,AUS,"Austin, TX: Austin - Bergstrom International",749.0,111.0,36.77,3.42,...,0.05,47.26,2.0,0.0,5112.0,1398.0,219.0,752.0,1.0,2742.0
4,2017,3,AA,American Airlines Inc.,BDL,"Hartford, CT: Bradley International",410.0,102.0,28.99,2.59,...,0.00,25.41,22.0,0.0,3929.0,1185.0,232.0,1183.0,0.0,1329.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16786,2018,10,YX,Republic Airline,TPA,"Tampa, FL: Tampa International",3.0,0.0,0.00,0.00,...,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16787,2018,10,YX,Republic Airline,TVC,"Traverse City, MI: Cherry Capital",55.0,12.0,2.11,1.93,...,0.00,5.89,0.0,0.0,1108.0,66.0,584.0,81.0,0.0,377.0
16788,2018,10,YX,Republic Airline,TYS,"Knoxville, TN: McGhee Tyson",13.0,2.0,0.00,0.00,...,0.00,1.00,0.0,0.0,200.0,0.0,0.0,26.0,0.0,174.0
16789,2018,10,YX,Republic Airline,VPS,"Valparaiso, FL: Eglin AFB Destin Fort Walton B...",31.0,2.0,0.79,0.00,...,0.00,0.00,1.0,1.0,66.0,33.0,0.0,33.0,0.0,0.0


In [32]:
df_delays = pd.concat([df17, df18], ignore_index=True)
df_delays

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2017,3,AA,American Airlines Inc.,ABQ,"Albuquerque, NM: Albuquerque International Sun...",145.0,26.0,10.83,0.00,...,0.00,9.31,0.0,0.0,935.0,437.0,0.0,193.0,0.0,305.0
1,2017,3,AA,American Airlines Inc.,ALB,"Albany, NY: Albany International",89.0,16.0,2.12,0.41,...,0.00,7.26,3.0,0.0,689.0,100.0,40.0,139.0,0.0,410.0
2,2017,3,AA,American Airlines Inc.,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",1205.0,241.0,78.79,4.13,...,0.95,82.14,9.0,3.0,12584.0,3827.0,372.0,2839.0,38.0,5508.0
3,2017,3,AA,American Airlines Inc.,AUS,"Austin, TX: Austin - Bergstrom International",749.0,111.0,36.77,3.42,...,0.05,47.26,2.0,0.0,5112.0,1398.0,219.0,752.0,1.0,2742.0
4,2017,3,AA,American Airlines Inc.,BDL,"Hartford, CT: Bradley International",410.0,102.0,28.99,2.59,...,0.00,25.41,22.0,0.0,3929.0,1185.0,232.0,1183.0,0.0,1329.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25170,2018,10,YX,Republic Airline,TPA,"Tampa, FL: Tampa International",3.0,0.0,0.00,0.00,...,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25171,2018,10,YX,Republic Airline,TVC,"Traverse City, MI: Cherry Capital",55.0,12.0,2.11,1.93,...,0.00,5.89,0.0,0.0,1108.0,66.0,584.0,81.0,0.0,377.0
25172,2018,10,YX,Republic Airline,TYS,"Knoxville, TN: McGhee Tyson",13.0,2.0,0.00,0.00,...,0.00,1.00,0.0,0.0,200.0,0.0,0.0,26.0,0.0,174.0
25173,2018,10,YX,Republic Airline,VPS,"Valparaiso, FL: Eglin AFB Destin Fort Walton B...",31.0,2.0,0.79,0.00,...,0.00,0.00,1.0,1.0,66.0,33.0,0.0,33.0,0.0,0.0


In [33]:
## confirm concat did not lose rows
df_delays.shape[0] == df18.shape[0] + df17.shape[0]

True