# Lecture 3 - Part 1 : String (Text) Operations<a name="_string (text) operations"></a>

In the last lecture you were introduced to Pandas library and so you learned writing and reading files, data selection, description, reorganization and tranformation as well as counting values and filtering. 

What you learned is the first step of data analysis, that is data assessement as well as a bit of data clearning such as dropping data and dealing with NA. 

Cleaning the data can include cleaning missing values but also cleaning string data. 
Cleaning string data includes a set of operations such as string replacement, deletion, transformation, splitting, etc. 

Luckily Pandas include the `str` functions that allow to do a lot of cleaning operations.

You can find documentation here: http://pandas.pydata.org/pandas-docs/stable/text.html



## Fixing Case on String Columns

To fix the case of a string column:
- `.lower()` transform to lower-case  
- `.upper()` to upper-case 
- `.title()` transforms the values in a column to lower-case keeping the first letter in upper-case.

You should always use `.str` before the function like this: `expression.str.function()`.

`expression` can be a code for accesing the first column for example: `data['column'].str.function()`.



## Search for String Values

To search for rows where a string "word" is contained in part of a specific column:
- `data['column'].str.contains("word")` 
This will give you a vector of `TRUE` and `FALSE` where `TRUE` means that the value contains the word and `FALSE` that it does not.

Other str search functions:
- `str.match` : checks if a string matches the values in a column. You can also check for by using only the beginning of the sting  
- `str.contains`
- `str.startswith`
- `str.endswith`

You can also use **regular expressions** in searches. A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. 

Regular expressions offer a more flexible search. Example:
`/b[aeiou]t` :	Matches "bat", "bet", "bit", "bot" and "but" Also matches "cricket bat", "bitter lemon"

We won't go through them in this lecture. But you can read more about Regular Expressions here: https://docs.python.org/3/howto/regex.html

And some examples in pandas strings here: http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings (and also for searches).



## Split a String Column Into Multiple Columns
How can we take a text column and split it by a delimiter, like we can in Excel?  
To split a text column into several columns:

- `split("delimiter",expand)` where `delimiter` specifies the delimiter between the values to split and `expand` puts the result in new columns.

## Rename 
- `.replace("string1", "string2")` : rename a string or replace a string by another
- `data.rename(index={}, columns={})` : rename rows (index) and columns in a dataframe

You can specify the old and new names like this :`{name_old1:name_new1, name_old2:name_new1, ...}`.

You can also apply functions to rows and index names like this: `index=function` or `column=function`or `data.rename(str.function, axis='')`. 

Check out the following examples.


In [4]:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [71]:
df.rename(index={0: "x", 1: "y", 2: "z"})


Unnamed: 0,A,B
x,1,4
y,2,5
z,3,6


In [72]:
df.rename(columns={"A": "a", "B": "c"})


Unnamed: 0,a,c
0,1,4
1,2,5
2,3,6


In [73]:
df.rename(index=str)


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [74]:
df.rename(str.lower, axis='columns')


Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [75]:
df.rename({1: 2, 2: 4}, axis='index')

Unnamed: 0,A,B
0,1,4
2,2,5
4,3,6


## Inclass exercises

### The Data 
We will work on the `chicago_crimes.csv` file that contains data about the people who did crimes in chicago.
You can find the dataset here : https://www.kaggle.com/chicago/chicago-crime.
The description of the columns contained in this data file is the following:

- unique_key : Unique identifier for the record.
- case_number : The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
- date : Date when the incident occurred. this is sometimes a best estimate.
- block : The partially redacted address where the incident occurred, placing it on the same block as the actual address.
- iucr : The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
- primary_type : The primary description of the IUCR code.
- description : The secondary description of the IUCR code, a subcategory of the primary description.
- location_description : Description of the location where the incident occurred.
- arrest : Indicates whether an arrest was made.
- domestic : Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
- beat : Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
- district : Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
- ward : The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
- community_area : Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
- fbi_code : Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.
- x_coordinate : The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
- y_coordinate : The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
- year : Year the incident occurred.
- updated_on : Date and time the record was last updated.
- latitude : The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- longitude : The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
- location : The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

### Import the libraries as we did in the past lectures

In [2]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### Read the file chicago_crimes.csv

In [3]:
crimes = pd.read_csv("data/chicago_crimes.csv")

### Check the first values of the dataset

In [4]:
crimes.head(3)

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,...,9,54,6,,,2008,4/30/10 1:15,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,...,4,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,...,5,41,6,,,2008,5/24/10 1:12,,,


We will work on one of the columns that contain string data: `Primary Type` which tells us the type of crime the crimimal did. 

### Compute the number of occurance of each category in this column. 

In [5]:
crimes['Primary Type'].value_counts().head(3)

THEFT              605
BATTERY            358
CRIMINAL DAMAGE    187
Name: Primary Type, dtype: int64

### Fixing Case, Spacing, Etc. on String Columns<a name="_fixing case, spacing, etc. on string columns"></a>

#### Transform the values in `Primary Type` to lower case using `title()`

In [6]:
crimes['Primary Type'] = crimes['Primary Type'].str.title()
crimes.head(3)

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,Theft,FINANCIAL ID THEFT:$300 &UNDER,...,9,54,6,,,2008,4/30/10 1:15,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,Theft,FINANCIAL ID THEFT: OVER $300,...,4,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,Theft,FINANCIAL ID THEFT: OVER $300,...,5,41,6,,,2008,5/24/10 1:12,,,


#### Transform the values in `Description` to lower case using `lower()`

In [7]:
crimes['Description'] = crimes['Description'].str.lower()
crimes['Description'].head()

0    financial id theft:$300 &under
1     financial id theft: over $300
2     financial id theft: over $300
3     financial id theft: over $300
4           harassment by telephone
Name: Description, dtype: object

### Search for String Values<a name="_search for string values"></a>

#### Find all the rows where the description contains the  word "theft" 

In [8]:
crimes['Description'].str.contains("theft").head()

0     True
1     True
2     True
3     True
4    False
Name: Description, dtype: bool

#### Use this match to subset the crimes dataframe to those matching rows. Put the result in a variable called crimes_with_theft.

In [9]:
crimes_with_theft = crimes[crimes['Description'].str.contains("theft")]   


#### Show the `Primary Type` and the `Description` columns

In [10]:
crimes_with_theft[['Primary Type','Description']]

Unnamed: 0,Primary Type,Description
0,Theft,financial id theft:$300 &under
1,Theft,financial id theft: over $300
2,Theft,financial id theft: over $300
3,Theft,financial id theft: over $300
5,Theft,financial id theft:$300 &under
7,Theft,financial id theft: over $300
9,Theft,financial id theft: over $300
10,Theft,financial id theft: over $300
12,Theft,financial id theft:$300 &under
14,Theft,financial id theft: over $300


#### Count how many of each theft description type there are, using `value_counts()`:

In [11]:
#First get the rows where the `Description` contains "theft" : `crimes[crimes['Description'].str.contains("theft")]` 
#then access the `Description` column by adding `['Description']` then use `value_counts()` to count the values 
#of types.
crimes[crimes['Description'].str.contains("theft")]['Description'].value_counts()

financial id theft: over $300       247
financial id theft:$300 &under       68
retail theft                         24
agg: financial id theft              13
attempt financial identity theft     10
theft of labor/services               9
theft/recovery: automobile            8
theft of lost/mislaid prop            2
theft by lessee,non-veh               1
Name: Description, dtype: int64

#### Count the number of money thefts by searching for the \$ sign.

In [12]:
crimes[crimes['Description'].str.contains('\$')]['Description'].value_counts()

financial id theft: over $300     247
$300 and under                     82
financial id theft:$300 &under     68
over $300                          60
over $500                           1
Name: Description, dtype: int64

### Split a String Column Into Multiple Columns<a name="_split a string column into multiple columns"></a>

#### Split the Location column by the comma and put the results in a new column

In [13]:
crimes['Location'].head()

0                                        NaN
1    (41.80615476732223, -87.59280284925518)
2                                        NaN
3     (41.78228592790556, -87.5908189949899)
4    (41.76318136897587, -87.58973074990949)
Name: Location, dtype: object

In [14]:
crimes['Location'].str.split(",", expand=True)

Unnamed: 0,0,1
0,,
1,(41.80615476732223,-87.59280284925518)
2,,
3,(41.78228592790556,-87.5908189949899)
4,(41.76318136897587,-87.58973074990949)
5,(41.75955804915209,-87.58706033233304)
6,(41.75878806614018,-87.57425219676995)
7,(41.75219820520971,-87.55202598677165)
8,,
9,(41.90429535973555,-87.68844017387774)


#### Check the names of the resulting column. What do you notice?

In [24]:
# the columns are actually named with integers, not strings
crimes['Location'].str.split(",", expand=True).head()


Unnamed: 0,0,1
0,,
1,(41.80615476732223,-87.59280284925518)
2,,
3,(41.78228592790556,-87.5908189949899)
4,(41.76318136897587,-87.58973074990949)


#### Put the split results in a dataframe called `lat_lon` . 

In [17]:
lat_lon = crimes['Location'].str.split(",", expand=True)

#### Rename the resulting columns  "Lat" and "Lon" and transform the rows indices to string.

In [18]:
lat_lon = lat_lon.rename(index=str, columns={0:"Lat", 1:"Lon"})

In [19]:
lat_lon.head()

Unnamed: 0,Lat,Lon
0,,
1,(41.80615476732223,-87.59280284925518)
2,,
3,(41.78228592790556,-87.5908189949899)
4,(41.76318136897587,-87.58973074990949)


#### Remove the unwanted paranthesis, use `replace()` to get rid of them:

In [20]:
lat_lon['Lat'] = lat_lon['Lat'].str.replace("(", "")

In [21]:
lat_lon['Lon'] = lat_lon['Lon'].str.replace(")", "")

In [22]:
lat_lon.head()

Unnamed: 0,Lat,Lon
0,,
1,41.80615476732223,-87.59280284925518
2,,
3,41.78228592790556,-87.5908189949899
4,41.76318136897587,-87.58973074990949


#### What if we want to add that to the crimes dataframe?<a name="_ what if we want to add that to the crimes dataframe?"></a><a name="_ what if we want to add that to the crimes dataframe?"></a>

You can just set new columns in the crimes dataframe equal to the ones in this small new dataframe (you can call them different names in crimes, if you wanted):

In [23]:
type(lat_lon['Lat'])

pandas.core.series.Series

- Check the index of `lat_lon` dataframe. What do you notice?

In [51]:
lat_lon.index

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999'],
      dtype='object', length=2000)

- Check the index of crimes dataframe. What do you notice?

In [54]:
# Notice this index is ints, not strings. they don't match.
crimes.index

RangeIndex(start=0, stop=2000, step=1)

Because their indices are different, we can't set one equal to the other as columns-- we have to extract the values from the lat_lon and use those (it's just a giant list).  Other methods like pd.concat will also fail.

- Extract the values of the `Lat` and `Lon` columns from `lat_lon` dataframe using `.values` and put that in `crimes` dataframe under `Lat` and `Lon` columns.

In [55]:
crimes['Lat'] = lat_lon['Lat'].values

In [56]:
crimes.head()

Unnamed: 0,identification,Case Number,Date-Time,Date,Time,Block,Street,IUCR,Primary Type,Description,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Location,Latitude,Longitude,Lat
0,7446859,HS247325,1/1/08 0:01,1/1/08,0:01,004XX E 133RD ST,E 133RD ST,841,Theft,financial id theft:$300 &under,...,54,6,,,2008,4/30/10 1:15,,,,
1,6236266,HP323693,1/1/08 0:01,1/1/08,0:01,013XX E 49TH ST,E 49TH ST,840,Theft,financial id theft: over $300,...,39,6,1186146.0,1872823.0,2008,5/24/08 1:05,"(41.80615476732223, -87.59280284925518)",41.806155,-87.592803,41.80615476732223
2,7514546,HS317259,1/1/08 0:01,1/1/08,0:01,014XX E 59TH ST,E 59TH ST,840,Theft,financial id theft: over $300,...,41,6,,,2008,5/24/10 1:12,,,,
3,6422569,HP499242,1/1/08 0:01,1/1/08,0:01,014XX E 62ND ST,E 62ND ST,840,Theft,financial id theft: over $300,...,42,6,1186762.0,1864130.0,2008,8/17/08 1:04,"(41.78228592790556, -87.5908189949899)",41.782286,-87.590819,41.78228592790556
4,6013347,HP118097,1/1/08 0:01,1/1/08,0:01,014XX E 72ND PL,E 72ND PL,2825,Other Offense,harassment by telephone,...,43,26,1187119.0,1857171.0,2008,1/16/08 1:05,"(41.76318136897587, -87.58973074990949)",41.763181,-87.589731,41.76318136897587


#### Now split the `Location` column but don't put the result in new columns.

In [57]:
crimes['Location'].str.split(",").head()

0                                           NaN
1    [(41.80615476732223,  -87.59280284925518)]
2                                           NaN
3     [(41.78228592790556,  -87.5908189949899)]
4    [(41.76318136897587,  -87.58973074990949)]
Name: Location, dtype: object

In [28]:
crimes['Location'].str.split(",")[1]

['(41.80615476732223', ' -87.59280284925518)']

The new values are a list of strings.