# Introduction to Data Science - Week 3 - Exercises (Jupyter/Python)

-> https://github.com/rmmariano/CAP386_intro_data_science/blob/master/listas/03/week3/week3.md

Do the necessaries imports:

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [61]:
from os import makedirs
from os.path import exists, isfile, getsize
# import urllib.request as urllib
from urllib.request import urlretrieve

If the temporary directory doesn't exist, so make it:

In [62]:
temp_directory = "../TempData/"
if not exists(temp_directory):
    makedirs(temp_directory)

## First of all: download the CSV and read it in a variable

Download the CSV data from https://data.baltimorecity.gov/dataset/Food-Vendor-Locations/bqw3-z52q/data

In [63]:
vendors = "https://data.baltimorecity.gov/api/views/bqw3-z52q/rows.csv?accessType=DOWNLOAD"

urlretrieve(vendors, "../TempData/BFood.csv")

if isfile("../TempData/BFood.csv"):
    tam = getsize("../TempData/BFood.csv")
    print("File downloaded, ", tam, " bytes.")
else:
    print("Error downloading file!")

File downloaded,  15661  bytes.


Read the CSV keeping the same header:

In [64]:
bVendors = pd.read_csv("../TempData/BFood.csv")

print(bVendors.head())

   Id LicenseNum                                 VendorName  \
0   0   DF000166  Abdul-Ghani, Christina, "The Bullpen Bar"   
1   0   DF000075                                 Ali, Fathi   
2   0   DF000133                                 Ali, Fathi   
3   0   DF000136                                 Ali, Fathi   
4   0   DF000001                                 Ali, Yusuf   

                                          VendorAddr  \
0  508 Washington Blvd, confined within 10 x 10 s...   
1                   SEC Calvert & Madison on Calvert   
2                           NEC Baltimore & Pine Sts   
3                            NEC Light & Redwood Sts   
4  On Hamburg St across from the rear end of the ...   

                                           ItemsSold  \
0        Grilled food, pizza slices, gyro sandwiches   
1    Hot Dogs, Sausage, Snacks, Gum, Candies, Drinks   
2    Hot dogs, Sausage, drinks, snacks, gum, & candy   
3     Hot dogs, sausages, chips, snacks, drinks, gum   
4  L

## R Exercises

* I want to get all variations of "hot dog", including "frank". 
With <tt>ignore.case</tt> we will get all cases that match with "hot dog" or "frank" in both cases (lower or upper):

In [65]:
bVendors["hotdog"] = bVendors["ItemsSold"].str.contains(u"hot dog|frank", case=False)

bVendors[["ItemsSold", "hotdog"]].head()

Unnamed: 0,ItemsSold,hotdog
0,"Grilled food, pizza slices, gyro sandwiches",False
1,"Hot Dogs, Sausage, Snacks, Gum, Candies, Drinks",True
2,"Hot dogs, Sausage, drinks, snacks, gum, & candy",True
3,"Hot dogs, sausages, chips, snacks, drinks, gum",True
4,"Large & Small beef franks, soft drinks, water,...",True


* Now I want to get all variations of "pizza":

In [66]:
bVendors["pizza"] = bVendors["ItemsSold"].str.contains(u"pizza", case=False)

bVendors[["ItemsSold", "pizza"]].head()

Unnamed: 0,ItemsSold,pizza
0,"Grilled food, pizza slices, gyro sandwiches",True
1,"Hot Dogs, Sausage, Snacks, Gum, Candies, Drinks",False
2,"Hot dogs, Sausage, drinks, snacks, gum, & candy",False
3,"Hot dogs, sausages, chips, snacks, drinks, gum",False
4,"Large & Small beef franks, soft drinks, water,...",False


* Given a location, we want to extract the name of the town, so first we split the location, getting the part before the <tt>\\n</tt>:

In [84]:
#one_location = "Towson 21204\n(39.28540000000, -76.62260000000)"
one_location = "Owings Mill 21117\n(39.29860000000, -76.61280000000)"

location_vector = one_location.split("\n")

location_vector

['Owings Mill 21117', '(39.29860000000, -76.61280000000)']

Get the name of the town and zip code:

In [85]:
city_and_zip_code = location_vector[0]

city_and_zip_code

'Owings Mill 21117'

We want to separate both, so first we split it by white space and convert to list:

In [86]:
city_and_zip_code_list = city_and_zip_code.split(" ")

city_and_zip_code_list

['Owings', 'Mill', '21117']

So get the last index of the list:

In [87]:
last_index = len(city_and_zip_code_list) - 1

last_index

2

With this index, now we can separate the name of the city and the zip code, where the zip code is the last position and the name of the town is the rest:

In [88]:
zip_code = city_and_zip_code_list[last_index]

# remove the zip code of the list
del city_and_zip_code_list[last_index]

name_town = city_and_zip_code_list

print(name_town)

print(zip_code)

['Owings', 'Mill']
21117


If the name of the town have more than one word, so we need to assemble them again:

In [91]:
name_town = ' '.join(name_town)

name_town



bVendors.head()


Unnamed: 0,Id,LicenseNum,VendorName,VendorAddr,ItemsSold,Cart_Descr,St,Location 1,hotdog,pizza
0,0,DF000166,"Abdul-Ghani, Christina, ""The Bullpen Bar""","508 Washington Blvd, confined within 10 x 10 s...","Grilled food, pizza slices, gyro sandwiches",Two add'l tables to be added to current 6' tab...,MD,"Towson 21204\n(39.28540000000, -76.62260000000)",False,True
1,0,DF000075,"Ali, Fathi",SEC Calvert & Madison on Calvert,"Hot Dogs, Sausage, Snacks, Gum, Candies, Drinks",Pushcart,MD,"Owings Mill 21117\n(39.29860000000, -76.612800...",True,False
2,0,DF000133,"Ali, Fathi",NEC Baltimore & Pine Sts,"Hot dogs, Sausage, drinks, snacks, gum, & candy",Pushcart,MD,"Owings Mill 21117\n(39.28920000000, -76.626700...",True,False
3,0,DF000136,"Ali, Fathi",NEC Light & Redwood Sts,"Hot dogs, sausages, chips, snacks, drinks, gum",Pushcart,MD,"Owings Mill 21117\n(39.28870000000, -76.613600...",True,False
4,0,DF000001,"Ali, Yusuf",On Hamburg St across from the rear end of the ...,"Large & Small beef franks, soft drinks, water,...",grey pushcart on three wheels,MD,"Baltimore 21239\n(39.27920000000, -76.62200000...",True,False


Change the <tt>Location.1</tt> column to <tt>location</tt>: <tt>bVendors</tt>:

In [92]:
# names(bVendors)[names(bVendors) == "Location.1"] <- "location"

bVendors = bVendors.rename(columns={"Location.1": "location"})

bVendors.head()

Unnamed: 0,Id,LicenseNum,VendorName,VendorAddr,ItemsSold,Cart_Descr,St,Location 1,hotdog,pizza
0,0,DF000166,"Abdul-Ghani, Christina, ""The Bullpen Bar""","508 Washington Blvd, confined within 10 x 10 s...","Grilled food, pizza slices, gyro sandwiches",Two add'l tables to be added to current 6' tab...,MD,"Towson 21204\n(39.28540000000, -76.62260000000)",False,True
1,0,DF000075,"Ali, Fathi",SEC Calvert & Madison on Calvert,"Hot Dogs, Sausage, Snacks, Gum, Candies, Drinks",Pushcart,MD,"Owings Mill 21117\n(39.29860000000, -76.612800...",True,False
2,0,DF000133,"Ali, Fathi",NEC Baltimore & Pine Sts,"Hot dogs, Sausage, drinks, snacks, gum, & candy",Pushcart,MD,"Owings Mill 21117\n(39.28920000000, -76.626700...",True,False
3,0,DF000136,"Ali, Fathi",NEC Light & Redwood Sts,"Hot dogs, sausages, chips, snacks, drinks, gum",Pushcart,MD,"Owings Mill 21117\n(39.28870000000, -76.613600...",True,False
4,0,DF000001,"Ali, Yusuf",On Hamburg St across from the rear end of the ...,"Large & Small beef franks, soft drinks, water,...",grey pushcart on three wheels,MD,"Baltimore 21239\n(39.27920000000, -76.62200000...",True,False


* Now we will do it for the entire dataframe:

In [None]:
# creating the auxiliary vectors
name_town_vector <- vector(length = nrow(bVendors), mode = "character")
zip_code_vector <- vector(length = nrow(bVendors), mode = "character")


for(i in 1:nrow(bVendors)) {
  location_vector = unlist(strsplit(bVendors$location[i], "\n"))
  city_and_zip_code = location_vector[1]
  
  city_and_zip_code_char <- unlist(strsplit(city_and_zip_code, " "))
  city_and_zip_code_list = as.vector(city_and_zip_code_char, mode="list") 
  
  # get the last index in list
  last_index = length(city_and_zip_code_list)
  
  # get the zip code from list
  zip_code = city_and_zip_code_list[last_index]
  
  # zip_code is a list with one string, so we join them in a string
  zip_code = paste(zip_code, collapse = '')
  
  # remove the zip code of the list
  city_and_zip_code_list[last_index] <- NULL
  
  name_town = city_and_zip_code_list
  
  # if the name of town has more than one word, join them
  name_town = paste(name_town, collapse = ' ')
  
  name_town_vector[i] = name_town
  zip_code_vector[i] = zip_code
}

In [None]:
head(name_town_vector)

In [None]:
head(zip_code_vector)

In [None]:
str(bVendors)

Put the vectors with name of town and zip code in the <tt>bVendors</tt>:

In [None]:
bVendors$name_town <- name_town_vector
bVendors$zip_code <- zip_code_vector

str(bVendors)

In [None]:
head(subset(bVendors, select = c(name_town, zip_code, location)))

It is all OK.