As part of this simple analysis, we are going to analyze per capita consumption of liquor in Iowa. To do that, we will need to gather populatoin data. The data will be taken from the [County Population in Iowa by Year](https://data.iowa.gov/Community-Demographics/County-Population-in-Iowa-by-Year/qtnr-zsrc/explore/query/SELECT%0A%20%20%60fips%60%2C%0A%20%20%60geographicname%60%2C%0A%20%20%60year%60%2C%0A%20%20%60population%60%2C%0A%20%20%60primary_point%60%2C%0A%20%20%60%3A%40computed_region_hhz5_dst4%60%2C%0A%20%20%60%3A%40computed_region_y683_txed%60%2C%0A%20%20%60%3A%40computed_region_g8ff_h7ce%60/page/filter) dataset. Once downloaded, that dataset will be loaded into the relevant working directory


In [1]:
import pandas as pd

In [2]:
#reading in county population data
county_pop = pd.read_csv('County_Population_in_Iowa_by_Year.csv')

Below is a quick visual overview of the columns in this dataset. The dataframe info is also displayed as well. From this dataset, we are primarily intereted in the 'County', 'Year' and 'Population' columns.

In [3]:
county_pop.head()

Unnamed: 0,FIPS,County,Year,Population,Primary Point,Rating Areas for Iowa Individual Affordable Care Act Premiums,County Boundaries of Iowa,Iowa Regional Zip Codes
0,19043,Clayton County,July 01 2012,17946,POINT (-91.3414328 42.8447493),6,20,
1,19171,Tama County,July 01 2012,17501,POINT (-92.5325425 42.0798117),1,47,
2,19145,Page County,July 01 2012,15702,POINT (-95.1501747 40.7391444),4,91,
3,19065,Fayette County,July 01 2012,20774,POINT (-91.8443207 42.8625919),7,21,
4,19107,Keokuk County,July 01 2012,10427,POINT (-92.1786426 41.336465),5,73,


In [4]:
county_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1089 entries, 0 to 1088
Data columns (total 8 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   FIPS                                                           1089 non-null   int64  
 1   County                                                         1089 non-null   object 
 2   Year                                                           1089 non-null   object 
 3   Population                                                     1089 non-null   int64  
 4   Primary Point                                                  1089 non-null   object 
 5   Rating Areas for Iowa Individual Affordable Care Act Premiums  1089 non-null   int64  
 6   County Boundaries of Iowa                                      1089 non-null   int64  
 7   Iowa Regional Zip Codes                                      

Unfortunately, this dataset does not contain the Iowa County Numbers, which are useful when trying to merge datasets. Therefore we will need to extract those. To do this, I've created a simple function to extract pdf data from this [pdf](https://tax.iowa.gov/sites/default/files/2020-07/Iowa%20County%20Names%20and%20Numbers.pdf) taken from tax.iowa.gov. 

In [5]:
from PyPDF2 import PdfReader

def extract_data_from_pdf(pdf_path):
    #create pdfreader object
    pdf = PdfReader(pdf_path)

    #lists to store county names and numbers
    county_numbers = []
    county_names = []

    #main block to extract text from the first page of the pdf
    text = pdf.pages[0].extract_text()
    lines = text.split('\n') #each line contains new county info
    for line in lines:
        if '-' in line and '76-002' not in line:  # exclude line with '76-002'
            county_number, county_name = line.split('-', 1) # split line on hyphen
            county_numbers.append(int(county_number.strip()))  # convert to county number to int
            county_names.append(county_name.strip().upper())  # convert county name to uppercase
        elif 'O’BRIEN' in line: #special handling of Obrien county 
            parts = line.split('\t')
            county_numbers.append(parts[0].strip())
            county_names.append(parts[1].strip().upper())

    #df to store county_numbers and names
    df = pd.DataFrame({
        'County Number': county_numbers,
        'County': county_names
    })

    '''
    Most data was easily extracted, but a handful of county numbers had incorrect/mismatched
    data, so I decided to simply remove those rows and manually add them back in. Given more time,
    a more elegant solution would have been developed.
    '''

    #delete rows where 'County Number' equals 33, 66, and 71
    df = df[~df['County Number'].isin([33, 66, 71])]

    #manually set new entries for 33, 66, and 71 in list of dictionaries
    manual_entries = [
        {'County Number': 33, 'County': 'FAYETTE'},
        {'County Number': 34, 'County': 'FLOYD'},
        {'County Number': 66, 'County': 'MITCHELL'},
        {'County Number': 67, 'County': 'MONONA'},
        {'County Number': 71, 'County': "O'BRIEN"}
    ]

    #create df from manual entries
    manual_entries_df = pd.DataFrame(manual_entries)

    #concatenate df from original sweep with manual entry data
    df = pd.concat([df, manual_entries_df], ignore_index=True)

    #sort the df based on county number column in ascending order
    df = df.sort_values('County Number', ascending=True)

    return df


In [6]:
#file path
path = 'Iowa County Names and Numbers.pdf'
#instantiate df countaining county data
county_df = extract_data_from_pdf(path)

After running that, we can see that we've successfully extracted the county names and numbers from the pdf.

In [7]:
county_df.head()

Unnamed: 0,County Number,County
0,1,ADAIR
1,2,ADAMS
2,3,ALLAMAKEE
3,4,APPANOOSE
4,5,AUDUBON


Returning to the County Population dataset, we can see that the Year column isn't in an acceptable format for storage in an SQL databse. Let's change that.

In [8]:
county_pop.head()

Unnamed: 0,FIPS,County,Year,Population,Primary Point,Rating Areas for Iowa Individual Affordable Care Act Premiums,County Boundaries of Iowa,Iowa Regional Zip Codes
0,19043,Clayton County,July 01 2012,17946,POINT (-91.3414328 42.8447493),6,20,
1,19171,Tama County,July 01 2012,17501,POINT (-92.5325425 42.0798117),1,47,
2,19145,Page County,July 01 2012,15702,POINT (-95.1501747 40.7391444),4,91,
3,19065,Fayette County,July 01 2012,20774,POINT (-91.8443207 42.8625919),7,21,
4,19107,Keokuk County,July 01 2012,10427,POINT (-92.1786426 41.336465),5,73,


In [9]:
#converting Year column into correct date format
county_pop['Year'] = pd.to_datetime(county_pop['Year'], errors='coerce')
county_pop['Year'] = county_pop['Year'].dt.strftime('%Y-%m-%d')

In [10]:
county_pop.head()

Unnamed: 0,FIPS,County,Year,Population,Primary Point,Rating Areas for Iowa Individual Affordable Care Act Premiums,County Boundaries of Iowa,Iowa Regional Zip Codes
0,19043,Clayton County,2012-07-01,17946,POINT (-91.3414328 42.8447493),6,20,
1,19171,Tama County,2012-07-01,17501,POINT (-92.5325425 42.0798117),1,47,
2,19145,Page County,2012-07-01,15702,POINT (-95.1501747 40.7391444),4,91,
3,19065,Fayette County,2012-07-01,20774,POINT (-91.8443207 42.8625919),7,21,
4,19107,Keokuk County,2012-07-01,10427,POINT (-92.1786426 41.336465),5,73,


Much better. Now we can drop some unnecessary columns and remove the word 'County' from the County column. We will also convert it into upper case.

In [11]:
county_pop = county_pop.drop(columns=['FIPS', 'Iowa Regional Zip Codes', 'Rating Areas for Iowa Individual Affordable Care Act Premiums', 'County Boundaries of Iowa'])
county_pop['County'] = county_pop['County'].str.replace(' County', '')
county_pop['County'] = county_pop['County'].str.upper()


In [12]:
county_pop.head()

Unnamed: 0,County,Year,Population,Primary Point
0,CLAYTON,2012-07-01,17946,POINT (-91.3414328 42.8447493)
1,TAMA,2012-07-01,17501,POINT (-92.5325425 42.0798117)
2,PAGE,2012-07-01,15702,POINT (-95.1501747 40.7391444)
3,FAYETTE,2012-07-01,20774,POINT (-91.8443207 42.8625919)
4,KEOKUK,2012-07-01,10427,POINT (-92.1786426 41.336465)


Once that's done, we can simply merge the datasets together on county name to successfully map County Number to Population data. That data is saved to a csv file.

In [13]:

merged_county_data = pd.merge(county_pop, county_df[['County', 'County Number']], on='County', how='left')

In [14]:
merged_county_data.head()

Unnamed: 0,County,Year,Population,Primary Point,County Number
0,CLAYTON,2012-07-01,17946,POINT (-91.3414328 42.8447493),22
1,TAMA,2012-07-01,17501,POINT (-92.5325425 42.0798117),86
2,PAGE,2012-07-01,15702,POINT (-95.1501747 40.7391444),73
3,FAYETTE,2012-07-01,20774,POINT (-91.8443207 42.8625919),33
4,KEOKUK,2012-07-01,10427,POINT (-92.1786426 41.336465),54


In [15]:
len(merged_county_data)

1089

In [16]:
merged_county_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1089 entries, 0 to 1088
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   County         1089 non-null   object
 1   Year           1089 non-null   object
 2   Population     1089 non-null   int64 
 3   Primary Point  1089 non-null   object
 4   County Number  1089 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 42.7+ KB


In [None]:
merged_county_data.to_csv('county_pop_data.csv')

The final step is to send that data to a mysql databse for storage.

In [21]:
import mysql.connector

#connect to mysql server
connection = mysql.connector.connect(
    host="host",
    user="root",
    password="not_my_pass",
    database="mysql_database"
)

#cursor to execute SQL commmands
cursor = connection.cursor()


create_table_query = """
    CREATE TABLE IF NOT EXISTS county_population_data (
        County VARCHAR(255),
        Date DATE,
        Population INT,
        Primary_Point VARCHAR(255),
        County_Number FLOAT
    )
"""
cursor.execute(create_table_query)

#convert df to list of tuples
row_data = [tuple(row) for row in merged_county_data.values]

#insertion query
insert_query = """
    INSERT INTO county_population_data (County, Date, Population, Primary_Point, County_Number)
    VALUES (%s, %s, %s, %s, %s)
"""

cursor.executemany(insert_query, row_data)

connection.commit()
connection.close()
