In [1]:
import pandas as pd
import psycopg2

def execute_query(sql_query, dbname='temp', user='postgres', password='postgres', port='5432'):
    # Create a connection to the PostgreSQL database
    conn = psycopg2.connect(dbname=dbname, user=user, password=password, port=port)

    # Use read_sql to execute the query and load the results into a DataFrame
    df = pd.read_sql(sql_query, conn)

    # Close the database connection
    conn.close()

    # Return the DataFrame
    return df


# Count the categories

In this chapter, we'll be working mostly with the Evanston 311 data in table evanston311. This is data on help requests submitted to the city of Evanston, IL.

This data has several character columns. Start by examining the most frequent values in some of these columns to get familiar with the common categories.

In [3]:

query_result = execute_query("""
-- Select the count of each level of priority
SELECT priority, COUNT(*)
  FROM evanston311
 GROUP BY priority;                       
""")
query_result

Unnamed: 0,priority,count
0,MEDIUM,5745
1,NONE,30081
2,HIGH,88
3,LOW,517


In [4]:

query_result = execute_query("""
-- Find values of zip that appear in at least 100 rows
-- Also get the count of each value
SELECT DISTINCT zip, COUNT(*)
  FROM evanston311
 GROUP BY zip
HAVING COUNT(*) >=100;                     
""")
query_result

Unnamed: 0,zip,count
0,60201.0,19054
1,60202.0,11165
2,60208.0,255
3,,5528


In [5]:

query_result = execute_query("""
-- Find values of source that appear in at least 100 rows
-- Also get the count of each value
SELECT DISTINCT source, COUNT(*)
  FROM evanston311
 GROUP BY source
HAVING COUNT(*)>=100;                  
""")
query_result

Unnamed: 0,source,count
0,Android,444
1,gov.publicstuff.com,30985
2,Iframe,3670
3,iOS,1199


In [6]:

query_result = execute_query("""
-- Find the 5 most common values of street and the count of each
SELECT street, COUNT(*)
  FROM evanston311
 GROUP BY street
 ORDER BY count DESC
 LIMIT 5;             
""")
query_result

Unnamed: 0,street,count
0,,1699
1,Chicago Avenue,1440
2,Sherman Avenue,1276
3,Central Street,1211
4,Davis Street,1154


# Spotting character data problems

Explore the distinct values of the street column. Select each street value and the count of the number of rows with that value. Sort the results by street to see similar values near each other.

In [11]:

query_result = execute_query("""
-- Find the 5 most common values of street and the count of each
SELECT DISTINCT street, COUNT(street)
  FROM evanston311
 GROUP BY street
 ORDER BY street;
""")
query_result

Unnamed: 0,street,count
0,1/2 Chicago Ave,1
1,1047B Chicago Ave,1
2,13th Street,1
3,141A Callan Ave,2
4,141b Callan Ave,1
...,...,...
700,Wilmette Avenue,2
701,Woodbine Ave,4
702,Woodbine Avenue,35
703,Woodland Road,18



- The street suffix (e.g. Street, Avenue) is sometimes abbreviated
- House/street numbers sometimes appear in the column
- Capitalization is not consistent across values

# Trimming

Some of the street values in evanston311 include house numbers with # or / in them. In addition, some street values end in a ..

Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the street values as a first attempt at cleaning up the values.

In [12]:

query_result = execute_query("""
SELECT distinct street,
       -- Trim off unwanted characters from street
       trim(street, '0123456789 #/.') AS cleaned_street
  FROM evanston311
 ORDER BY street;
""")
query_result

Unnamed: 0,street,cleaned_street
0,1/2 Chicago Ave,Chicago Ave
1,1047B Chicago Ave,B Chicago Ave
2,13th Street,th Street
3,141A Callan Ave,A Callan Ave
4,141b Callan Ave,b Callan Ave
...,...,...
700,Wilmette Avenue,Wilmette Avenue
701,Woodbine Ave,Woodbine Ave
702,Woodbine Avenue,Woodbine Avenue
703,Woodland Road,Woodland Road


# Exploring unstructured text

The description column of evanston311 has the details of the inquiry, while the category column groups inquiries into different types. How well does the `category` capture what's in the `description`?

In [13]:

query_result = execute_query("""
-- Count rows
SELECT COUNT(*)
  FROM evanston311
 -- Where description includes trash or garbage
 WHERE description ILIKE '%trash%'
    OR description ILIKE '%garbage%';
""")
query_result

Unnamed: 0,count
0,2551


In [14]:

query_result = execute_query("""
-- Select categories containing Trash or Garbage
SELECT category
  FROM evanston311
 -- Use LIKE
 WHERE category LIKE '%Trash%'
    OR category LIKE '%Garbage%';
""")
query_result

Unnamed: 0,category
0,THIS REQUEST IS INACTIVE...Trash Cart - Compos...
1,Trash - Tire Pickup
2,Trash - Special Pickup - Resident Use
3,"Trash, Recycling, Yard Waste Cart- Repair/Repl..."
4,"Trash, Recycling, Yard Waste Cart- Repair/Repl..."
...,...
5811,Trash - Missed Garbage Pickup
5812,Trash - Appliance Pickup
5813,Trash - Missed Garbage Pickup
5814,Trash - Appliance Pickup


In [15]:

query_result = execute_query("""
-- Count rows
SELECT COUNT(*)
  FROM evanston311 
 -- description contains trash or garbage (any case)
 WHERE (description ILIKE '%trash%'
    OR description ILIKE '%garbage%') 
 -- category does not contain Trash or Garbage
   AND category NOT LIKE '%Trash%'
   AND category NOT LIKE '%Garbage%';
""")
query_result

Unnamed: 0,count
0,570


In [16]:

query_result = execute_query("""
-- Count rows with each category
SELECT category, COUNT(*)
  FROM evanston311 
 WHERE (description ILIKE '%trash%'
    OR description ILIKE '%garbage%') 
   AND category NOT LIKE '%Trash%'
   AND category NOT LIKE '%Garbage%'
 -- What are you counting?
 GROUP BY category
 --- order by most frequent values
 ORDER BY count DESC
 LIMIT 10;
""")
query_result

Unnamed: 0,category,count
0,Ask A Question / Send A Message,273
1,Rodents- Rats,77
2,Recycling - Missed Pickup,28
3,Dead Animal on Public Property,16
4,Graffiti,15
5,Yard Waste - Missed Pickup,14
6,Public Transit Agency Issue,13
7,Food Establishment - Unsanitary Conditions,13
8,Exterior Conditions,10
9,Street Sweeping,9


# Concatenate strings

House number (house_num) and street are in two separate columns in evanston311. Concatenate them together  with a space in between the values.

In [17]:

query_result = execute_query("""
-- Concatenate house_num, a space, and street
-- and trim spaces from the start of the result
SELECT TRIM(CONCAT(house_num,' ', street)) AS address
  FROM evanston311;
""")
query_result

Unnamed: 0,address
0,606-612 Sheridan Road
1,930 Washington St
2,1183-1223 Lincoln St
3,1–111 Callan Ave
4,1524 Crain St
...,...
36426,
36427,
36428,800 Clark Street
36429,


# Split strings on a delimiter

The street suffix is the part of the street name that gives the type of street, such as Avenue, Road, or Street. In the Evanston 311 data, sometimes the `street` suffix is the full word, while other times it is the abbreviation.

In [18]:

query_result = execute_query("""
-- Select the first word of the street value
SELECT SPLIT_PART(street, ' ', 1) AS street_name, 
       count(*)
  FROM evanston311
 GROUP BY street_name
 ORDER BY count DESC
 LIMIT 20;
""")
query_result

Unnamed: 0,street_name,count
0,,1699
1,Chicago,1569
2,Central,1529
3,Sherman,1479
4,Davis,1248
5,Church,1225
6,Main,880
7,Sheridan,842
8,Ridge,823
9,Dodge,816


# Shorten long strings

The description column of evanston311 can be very long. You can get the length of a string with the length() function.

For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left() function to get a specified number of characters at the start of each value.

In [19]:

query_result = execute_query("""
-- Select the first 50 chars when length is greater than 50
SELECT CASE WHEN length(description) > 50
            THEN LEFT(description, 50) || '...'
       -- otherwise just select description
       ELSE description
       END
  FROM evanston311
 -- limit to descriptions that start with the word I
 WHERE description LIKE 'I %'
 ORDER BY description;
""")
query_result

Unnamed: 0,description
0,I work for Schermerhorn & Co. and manage this...
1,I accidentally mistyped my license plate numbe...
2,I accidentally sent the wrong cover letter on ...
3,I acquired c diff at north shore hospital in E...
4,I am a 35 year resident of Evanston (314 Custe...
...,...
390,I would like to see my water charges for the l...
391,I would like to speak with someone from the fi...
392,I would like to start new water service at thi...
393,I would love to come on Thursday June 1st anyt...


# Create an "other" category

If we want to summarize Evanston 311 requests by zip code, it would be useful to group all of the low frequency zip codes together in an "other" category.

In [21]:

query_result = execute_query("""
SELECT zip, count(*) AS zipcount
          FROM evanston311
         GROUP BY zip
         ORDER by zipcount DESC
""")
query_result

Unnamed: 0,zip,zipcount
0,60201,19054
1,60202,11165
2,,5528
3,60208,255
4,60091,89
...,...,...
117,61111,1
118,46383,1
119,76645,1
120,60560,1


In [22]:

query_result = execute_query("""
SELECT CASE WHEN zipcount < 100 THEN 'other'
       ELSE zip
       END AS zip_recoded,
       sum(zipcount) AS zipsum
  FROM (SELECT zip, count(*) AS zipcount
          FROM evanston311
         GROUP BY zip) AS fullcounts
 GROUP BY zip_recoded
 ORDER BY zipsum DESC;
""")
query_result

Unnamed: 0,zip_recoded,zipsum
0,60201,19054.0
1,60202,11165.0
2,,5528.0
3,other,429.0
4,60208,255.0


# Group and recode values

There are almost 150 distinct values of evanston311.category. But some of these categories are similar, with the form "Main Category - Details". We can get a better sense of what requests are common if we aggregate by the main category.

In [23]:

query_result = execute_query("""
-- Fill in the command below with the name of the temp table
DROP TABLE IF EXISTS recode;

-- Create and name the temporary table
CREATE TEMP TABLE recode AS
-- Write the select query to generate the table 
-- with distinct values of category and standardized values
  SELECT DISTINCT category, 
         RTRIM(SPLIT_PART(category, '-', 1)) AS standardized
    -- What table are you selecting the above values from?
    FROM evanston311;
    
-- Look at a few values before the next step
SELECT DISTINCT standardized 
  FROM recode
 WHERE standardized LIKE 'Trash%Cart'
    OR standardized LIKE 'Snow%Removal%';
""")
query_result

Unnamed: 0,standardized
0,Snow Removal
1,Snow Removal/Concerns
2,Snow/Ice/Hazard Removal
3,Trash Cart
4,"Trash Cart, Recycling Cart"
5,"Trash, Recycling, Yard Waste Cart"


In [24]:

query_result = execute_query("""
-- Code from previous step
DROP TABLE IF EXISTS recode;

CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
    FROM evanston311;

-- Update to group trash cart values
UPDATE recode 
   SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';

-- Update to group snow removal values
UPDATE recode 
   SET standardized='Snow Removal' 
  WHERE standardized LIKE 'Snow%Removal%';
    
-- Examine effect of updates
SELECT DISTINCT standardized 
  FROM recode
 WHERE standardized LIKE 'Trash%Cart'
    OR standardized LIKE 'Snow%Removal%';
""")
query_result

Unnamed: 0,standardized
0,Snow Removal
1,Trash Cart


In [25]:

query_result = execute_query("""
-- Code from previous step
DROP TABLE IF EXISTS recode;

CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
    FROM evanston311;
  
UPDATE recode SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';

UPDATE recode SET standardized='Snow Removal' 
 WHERE standardized LIKE 'Snow%Removal%';

-- Update to group unused/inactive values
UPDATE recode 
   SET standardized='UNUSED' 
 WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart', 
               '(DO NOT USE) Water Bill',
               'DO NOT USE Trash', 
               'NO LONGER IN USE');

-- Examine effect of updates
SELECT DISTINCT standardized 
  FROM recode
 ORDER BY standardized;
""")
query_result

Unnamed: 0,standardized
0,Abandoned Bicycle on City Property
1,Abandoned Vehicle
2,Accessibility
3,ADA/Inclusion Aids
4,Advanced Disposal
...,...
110,Water Plant Tour
111,Water Quality
112,Water Service
113,Water Service Disruption


In [26]:

query_result = execute_query("""
-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
  FROM evanston311;
UPDATE recode SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';
UPDATE recode SET standardized='Snow Removal' 
 WHERE standardized LIKE 'Snow%Removal%';
UPDATE recode SET standardized='UNUSED' 
 WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart', 
               '(DO NOT USE) Water Bill',
               'DO NOT USE Trash', 'NO LONGER IN USE');

-- Select the recoded categories and the count of each
SELECT standardized, COUNT(*)
-- From the original table and table with recoded values
  FROM evanston311 
       INNER JOIN recode 
       -- What column do they have in common?
       ON evanston311.category=recode.category 
 -- What do you need to group by to count?
 GROUP BY standardized
 -- Display the most common val values first
 ORDER BY COUNT DESC;
""")
query_result

Unnamed: 0,standardized,count
0,Broken Parking Meter,6092
1,Trash,3699
2,Ask A Question / Send A Message,2595
3,Trash Cart,1902
4,Tree Evaluation,1879
...,...,...
110,General Payment Question,1
111,Line Down,1
112,Illegally Placed Newsrack,1
113,Key Request,1


# Create a table with indicator variables

Determine whether medium and high priority requests in the evanston311 data are more likely to contain requesters' contact information: an email address or phone number.

- Emails contain an `@`.
- Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: `555-555-1212`.

In [27]:

query_result = execute_query("""
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the indicators temp table
CREATE TEMP TABLE indicators AS
  -- Select id
  SELECT id, 
         -- Create the email indicator (find @)
         CAST (description LIKE '%@%' AS integer) AS email,
         -- Create the phone indicator
         CAST (description LIKE '%___-___-____%' AS integer) AS phone 
    -- What table contains the data? 
    FROM evanston311;

-- Inspect the contents of the new temp table
SELECT *
  FROM indicators;
""")
query_result

Unnamed: 0,id,email,phone
0,1340563,0.0,0.0
1,1826017,0.0,0.0
2,1849204,0.0,0.0
3,1880254,0.0,0.0
4,1972582,0.0,1.0
...,...,...,...
36426,3693675,,
36427,3725724,,
36428,3748787,,
36429,3806545,,


In [28]:

query_result = execute_query("""
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the temp table
CREATE TEMP TABLE indicators AS
  SELECT id, 
         CAST (description LIKE '%@%' AS integer) AS email,
         CAST (description LIKE '%___-___-____%' AS integer) AS phone 
    FROM evanston311;
  
-- Select the column you'll group by
SELECT priority,
       -- Compute the proportion of rows with each indicator
       SUM(email)/COUNT(*)::NUMERIC AS email_prop, 
       SUM(phone)/COUNT(*)::NUMERIC AS phone_prop
  -- Tables to select from
  FROM evanston311
       LEFT JOIN  indicators
       -- Joining condition
       ON evanston311.id = indicators.id
 -- What are you grouping by?
 GROUP BY priority;
""")
query_result

Unnamed: 0,priority,email_prop,phone_prop
0,MEDIUM,0.019669,0.018451
1,NONE,0.004122,0.005685
2,HIGH,0.011364,0.022727
3,LOW,0.005803,0.001934
