# Part 3 - Data Preparation - Scratchpad
Useful stuff from this section

## 3.1 - Cleaning Data
Examples only gives you hints and common tricks for cleaning data.  Point of reference notebook if you encounter problems.

In [1]:
import pandas as pd

### Coping with White space
strings often start or end with whitespace.  For this we can use 
 - `strip()`
 - `lstrip()`
 - `rstrip()`.

In [2]:
coursedata_df = pd.DataFrame({'coursecode': ['TM351', 'TU100 ', ' M269 '],
                             'points': [30, 60, 30],
                             'level': ['3', '1', '2']
                             })
# pull out the coursecodes as  a list, then join them with underscores to show the spaces.
'_'.join(coursedata_df['coursecode'])

'TM351_TU100 _ M269 '

In [3]:
# strip all whitespace using strip
'_'.join(coursedata_df['coursecode'].str.strip())

'TM351_TU100_M269'

In [7]:
# or change them all
coursedata_df['coursecode'] = coursedata_df['coursecode'].str.strip()
'_'.join(coursedata_df['coursecode'])

'TM351_TU100_M269'

### Coping with case
We can use string methods to change the case of elements in a column

 - `str.upper()`
 - `str.lower()`

In [8]:
coursedata_df['coursecode'].str.lower()

0    tm351
1    tu100
2     m269
Name: coursecode, dtype: object

In [9]:
coursedata_df.coursecode.str.upper()

0    TM351
1    TU100
2     M269
Name: coursecode, dtype: object

### Type casting
If necessary we can cast the type of a DF column to another type using `astype()` operator

In [10]:
# Check the datatype of each column
coursedata_df.dtypes

coursecode    object
level         object
points         int64
dtype: object

In [11]:
# recast the level and points values to be 64 bit floating pt numbers
coursedata_df[ ['level', 'points']] = coursedata_df[ ['level', 'points']].astype(float)
coursedata_df.dtypes

coursecode     object
level         float64
points        float64
dtype: object

If you need to cast a Series or DataFrame column to a numeric type, but there are likely to be some elements that aren't castable and need replacing with `NaN` (the not-a-number marker), use `pd.to_numeric()` with the `errors='coerce'` parameter to generate `NaN` for those values.

### Rounding numbers
using the `round(value, precision)` will round to the nearest value at the specified lvl of precision

In [13]:
# round to 2 d.p.
round(157248.22334673467, 2)

157248.22

if precision is not specified then `round()` will round to the nearest whole number

In [15]:
# round to an integer value
round(157248.22334673467)

157248

In [22]:
# round to the nearest thousnd (10^3).
round(157248.22334673467, -3)

157000.0

### splitting data from one column to multiple columns
example from [Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries](http://stackoverflow.com/questions/23317352/pandas-dataframe-split-column-into-multiple-columns-right-align-inconsistent-c). :ink doesn't work

In [27]:
addresses_df = pd.DataFrame({'City, State, Country': ['HUN', 'ESP', 'GBR', 'ESP', 'FRA',' ID, USA', 'GA, USA', 'Hoboken, NJ, USA', 'NJ, USA', 'AUS']})

In [28]:
addresses_df

Unnamed: 0,"City, State, Country"
0,HUN
1,ESP
2,GBR
3,ESP
4,FRA
5,"ID, USA"
6,"GA, USA"
7,"Hoboken, NJ, USA"
8,"NJ, USA"
9,AUS


We want to reshape this as 3 columns in a DF one column for each country, state and city

In [29]:
# split cell entries on the comma char by applying a split() method
columnsplitter = lambda x: pd.Series([i for i in (x.split(','))])
splitaddresses_df = addresses_df['City, State, Country'].apply(columnsplitter)
splitaddresses_df

Unnamed: 0,0,1,2
0,HUN,,
1,ESP,,
2,GBR,,
3,ESP,,
4,FRA,,
5,ID,USA,
6,GA,USA,
7,Hoboken,NJ,USA
8,NJ,USA,
9,AUS,,


In [31]:
# Split each cell entry on the comma, reverse the plit list and assign to a new Series columns.
splitter = lambda x: pd.Series([i for i in reversed(x.split(','))])
splitaddresses_df = addresses_df['City, State, Country'].apply(splitter)
splitaddresses_df

Unnamed: 0,0,1,2
0,HUN,,
1,ESP,,
2,GBR,,
3,ESP,,
4,FRA,,
5,USA,ID,
6,USA,GA,
7,USA,NJ,Hoboken
8,USA,NJ,
9,AUS,,


In [33]:
# now rename the columns
splitaddresses_df.rename(columns = {0:'Country',1:'State',2:'City'}, inplace=True)
splitaddresses_df

Unnamed: 0,Country,State,City
0,HUN,,
1,ESP,,
2,GBR,,
3,ESP,,
4,FRA,,
5,USA,ID,
6,USA,GA,
7,USA,NJ,Hoboken
8,USA,NJ,
9,AUS,,


### Recognising and Parsing time formats

For general information on handling time in *pandas*, see the *pandas* documentation: [Time Series / Date functionality](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)..


In [38]:
timedata_df = pd.DataFrame(
            {
                'item':['A','B','C'],
                'date':['12-5-12','30-08-11','17-10-10'],
                'datetime':['May 7, 2010, 11.14', 'April 22, 2011, 22.06', 'October 7, 2013, 00.01']
            })
timedata_df[['item', 'datetime', 'date']]

Unnamed: 0,item,datetime,date
0,A,"May 7, 2010, 11.14",12-5-12
1,B,"April 22, 2011, 22.06",30-08-11
2,C,"October 7, 2013, 00.01",17-10-10


In [41]:
# cast a column by specifying the way the datetime string element is formatted

# in this case we parse a date
pd.to_datetime(timedata_df['date'], format="%d-%m-%y")

0   2012-05-12
1   2011-08-30
2   2010-10-17
Name: date, dtype: datetime64[ns]

In [42]:
# cast a date and time
pd.to_datetime(timedata_df.datetime, format='%B %d, %Y, %H.%M')

0   2010-05-07 11:14:00
1   2011-04-22 22:06:00
2   2013-10-07 00:01:00
Name: datetime, dtype: datetime64[ns]

Some common datetime format elements are:

    %a - The abbreviated weekday name (e.g. 'Sun')
    %A - The  full  weekday  name (e.g. 'Sunday')
    %b - The abbreviated month name (e.g. 'Jan')
    %B - The  full  month  name (e.g. `January')
    %d - Day of the month (01..31)
    %H - Hour of the day, 24-hour clock (00..23)
    %I - Hour of the day, 12-hour clock (01..12)
    %j - Day of the year (001..366)
    %m - Month of the year (01..12)
    %M - Minute of the hour (00..59)
    %p - Meridian indicator (e.g. 'AM' or 'PM')
    %S - Second of the minute (00..60)
    %U - Week number of the current year, starting with the first Sunday as the first day of the first week (00..53)
    %W - Week number of the current year, starting with the first Monday as the first day of the first week (00..53)
    %w - Day of the week (Sunday is 0, 0..6)
    %y - Year without a century (00..99)
    %Y - Year with century (e.g. 2015)

For a full list of time-related codes, see the [Python's strftime directives](http://strftime.org/).    

If a string in not matched an error will be throm.  these can be *coerced* into the `NaT` value by setting `errors=coerce`

In [51]:
# Create a DataFrame containing something that is not a date.
timedata2_df = pd.DataFrame({
                'item': ['A','B','C'],
                'date':['66-65-64', '30-08-11', '17-10-10'],
                })
# timedata2_df
pd.to_datetime(timedata2_df.date, format='%d-%m-%y', errors='coerce')

0          NaT
1   2011-08-30
2   2010-10-17
Name: date, dtype: datetime64[ns]

### Glimpse of Regex

#### Cleaning numeric strings

In [52]:
messynumbers_df = pd.DataFrame({
    'messyvals':['£40000','UKP 25,000','25000 pounds Sterling']
})
# first remove any commas:
messynumbers_df['cleanvals'] = messynumbers_df['messyvals'].str.replace(',','')
messynumbers_df


Unnamed: 0,messyvals,cleanvals
0,£40000,£40000
1,"UKP 25,000",UKP 25000
2,25000 pounds Sterling,25000 pounds Sterling


In [54]:
 # apply regex to get rid of non numeric characters to left and right of digits we want to keep
messynumbers_df.replace(
        {'cleanvals':"^[^\d]*([\d]*)[^\d]*$"},
        {'cleanvals': r'\1'}, regex=True)

Unnamed: 0,messyvals,cleanvals
0,£40000,40000
1,"UKP 25,000",25000
2,25000 pounds Sterling,25000


#### extracting elements from a string

In [55]:
urls_df = pd.DataFrame({'url':['http://this.example.com/path/file.html',
                              'http://another.example.com/longer/path/file.json']})
urls_df


Unnamed: 0,url
0,http://this.example.com/path/file.html
1,http://another.example.com/longer/path/file.json


In [66]:
# create new columns for each of the extraxts based on the original
urls_df['domain'] = urls_df['url']
urls_df['filetype'] = urls_df['url']

# pull out the first item fidnd things between http:// and the next /
urls_df.replace({'domain':"http://([^/]*).*$"}, {'domain':r'\1'}, regex=True, inplace=True)

urls_df

Unnamed: 0,url,domain,filetype
0,http://this.example.com/path/file.html,this.example.com,http://this.example.com/path/file.html
1,http://another.example.com/longer/path/file.json,another.example.com,http://another.example.com/longer/path/file.json


In [67]:
# we can extend same regex to get the filetype
# urls_df.replace({'filetype':"^http://([^/]*).*\.([^\.]*)$"}, {'filetype':r'\2'},regex=True, inplace=True)
urls_df.replace({'filetype':"^http://([^/]*).*\.([^\.]*)$"}, {'filetype' : r'\2'}, regex=True, inplace=True)

urls_df

Unnamed: 0,url,domain,filetype
0,http://this.example.com/path/file.html,this.example.com,html
1,http://another.example.com/longer/path/file.json,another.example.com,json


In [68]:
urls_df

Unnamed: 0,url,domain,filetype
0,http://this.example.com/path/file.html,this.example.com,html
1,http://another.example.com/longer/path/file.json,another.example.com,json


call on both extracted values and reorder

In [69]:
urls_df.url.replace("^http://([^/]*).*\.([^\.]*)$",r'We got a(n) \2 file from \1', regex=True)

0       We got a(n) html file from this.example.com
1    We got a(n) json file from another.example.com
Name: url, dtype: object

## 3.2 - Selecting, Projecting, sorting and limiting

### Basic Manipulation of tabular data structures
We will use *pandas* and *pandasSQL* however pandasSQL, being based on sqlite __cannot handle spaces in column names__

In [1]:
import pandas as pd

### dataframes used in this section

In [5]:
# create the coursedata DF
courseData = {'coursecode':['TM351','TU100','M269'],
             'points':[30,60,30],
             'level':['3','2','1']
             }
course_df = pd.DataFrame(courseData)
course_df

Unnamed: 0,coursecode,level,points
0,TM351,3,30
1,TU100,2,60
2,M269,1,30


In [6]:
# create the ABCD DF:
ABCD = {'A':['a1','a2','a9'],
        'B':['b1','b4','b5'],
        'C':['c1','c7','c7'],
        'D':['c1','d9','d7']
       }
ABCD_df = pd.DataFrame(ABCD)
ABCD_df

Unnamed: 0,A,B,C,D
0,a1,b1,c1,c1
1,a2,b4,c7,d9
2,a9,b5,c7,d7


### setup pandasSQL 

In [7]:
# import the sqldf function from pandasql
from pandasql import sqldf

# make a simple wrapper function to supply the query 
pysqldf = lambda q: sqldf(q, globals())

In [8]:
# create and apply an SQL query
query = '''SELECT * FROM course_df;'''
# pass it through our wrapper function
pysqldf(query)

Unnamed: 0,coursecode,level,points
0,TM351,3,30
1,TU100,2,60
2,M269,1,30


### PROJECTION: choosing certain columns

#### Projection using _pandas_

In [10]:
result_df = ABCD_df[['A', 'C']]

# change the order
result_df = ABCD_df[['C','A']]

result_df

Unnamed: 0,C,A
0,c1,a1
1,c7,a2
2,c7,a9


#### Projection using SQL

In [12]:
# query = '''SELECT A, C FROM ABCD_df;'''
# result_df = pysqldf(query)

# change order of columns
query = '''SELECT C, A FROM ABCD_df;'''
result_df = pysqldf(query)
result_df

Unnamed: 0,C,A
0,c1,a1
1,c7,a2
2,c7,a9


In [13]:
# rename columns as well as select 
query = '''SELECT A, B AS Bcolumn, C AS othercolumn FROM ABCD_df;'''
result_df = pysqldf(query)
result_df

Unnamed: 0,A,Bcolumn,othercolumn
0,a1,b1,c1
1,a2,b4,c7
2,a9,b5,c7


### SELECTION: choosing certain rows

#### Selection using _pandas_

In [14]:
result_df = course_df[course_df['points']==30]
result_df

Unnamed: 0,coursecode,level,points
0,TM351,3,30
2,M269,1,30


In [16]:
# select and project over chosen columns
result_df = course_df[course_df['points']==30][['coursecode','level']]
result_df

Unnamed: 0,coursecode,level
0,TM351,3
2,M269,1


In [19]:
# Multiple conditions
result_df = ABCD_df[((ABCD_df['B']=='b1') | (ABCD_df['B']=='b4')) & (ABCD_df['C']=='c7')]
result_df

Unnamed: 0,A,B,C,D
1,a2,b4,c7,d9


#### Selection using SQL

In [20]:
query = ''' SELECT A,B,C,D
            FROM ABCD_df
            WHERE C='c7';
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,A,B,C,D
0,a2,b4,c7,d9
1,a9,b5,c7,d7


In [21]:
# we can make WHERE as complex as we need it to be
query = ''' SELECT *
            FROM course_df
            WHERE points = 30 AND level = '1';
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,coursecode,level,points
0,M269,1,30


In [25]:
# SQL or condition
query = ''' SELECT *
            FROM ABCD_df
            WHERE B = 'b1' OR B = 'b4';
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,A,B,C,D
0,a1,b1,c1,c1
1,a2,b4,c7,d9


### Limiting the number of rows displayed

In [26]:
course_df.head(2)

Unnamed: 0,coursecode,level,points
0,TM351,3,30
1,TU100,2,60


In [28]:
# pandas also supports slicing
print(course_df[0:2])
print(course_df[:4])
print(course_df[2:])

  coursecode level  points
0      TM351     3      30
1      TU100     2      60
  coursecode level  points
0      TM351     3      30
1      TU100     2      60
2       M269     1      30
  coursecode level  points
2       M269     1      30


#### SQL

In [29]:
query = ''' SELECT *
            FROM course_df
            LIMIT 2;
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,coursecode,level,points
0,TM351,3,30
1,TU100,2,60


### SORTING the rows displayed

#### Pandas


In [32]:
result_df = course_df.sort_values(by='coursecode')
result_df

Unnamed: 0,coursecode,level,points
2,M269,1,30
0,TM351,3,30
1,TU100,2,60


In [33]:
result_df = course_df.sort_values(by='coursecode', ascending=False)
result_df

Unnamed: 0,coursecode,level,points
1,TU100,2,60
0,TM351,3,30
2,M269,1,30


In [34]:
# Multi - column sorting
result_df = course_df.sort_values(by=['points','level'], ascending=[True, False])
result_df

Unnamed: 0,coursecode,level,points
0,TM351,3,30
2,M269,1,30
1,TU100,2,60


#### SQL

In [35]:
query = ''' SELECT *
            FROM course_df
            ORDER BY coursecode;
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,coursecode,level,points
0,M269,1,30
1,TM351,3,30
2,TU100,2,60


In [36]:
query = ''' SELECT *
            FROM course_df
            ORDER BY coursecode DESC;
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,coursecode,level,points
0,TU100,2,60
1,TM351,3,30
2,M269,1,30


In [37]:
query = ''' SELECT *
            FROM course_df
            ORDER BY points ASC, coursecode DESC;
        '''
result_df = pysqldf(query)
result_df

Unnamed: 0,coursecode,level,points
0,TM351,3,30
1,M269,1,30
2,TU100,2,60


## 3.3 - Combining data from multiple datasets