## Using Pandas to parse large Excel files

This lesson will use `Pandas` library.

The file we’ll work with is a compilation of all the car accidents in England from 1979-2004. It's a 750mb csv file. 

Start by unziping the file, and then try to open Accidents7904.csv in Excel. **Be careful! If you don’t have enough memory, this could very well crash your computer.**

What happens?

You should see a *File Not Loaded Completely* error since Excel can only handle one million rows at a time. Luckily this is possible in python, and you don't need a super computer to do it!

### Inspecting large files

If you wish to inspect large files on Windows without having to open them, you can do this on a console: `more /E <path-to-your-file>`

On linux, do this on a console: `head -n <number-of-lines> <path-to-your-file>`

### Our goal?

To extract all accidents that happened in London in the year 2000.

A common approach is to read the big file and store it in a SQLite table. This has many advantages: you can then use SQL queries to filter the data without having to read all the data every time and using much less memory in the process. Also this prevents your machine from freezing while processing the file, allowing you to keep working on other stuff while the processing occurs.

Now let’s build the script.

In [28]:
from timeit import default_timer as timer # Depending on your machine, this might take awhile. This package will time the operation.
import pandas as pd
from sqlalchemy import create_engine

file = "Accidents7904.csv" # the file you want to read
csv_database = create_engine('sqlite:///csv_database.db') # the name of the database
table = 'original_csv' # the table's name
chunksize = 50000 # how many lines you wish to parse in every run. Change this according with your machine's RAM capabilities.
start = timer()

for df in pd.read_csv(file, chunksize=chunksize, iterator=True, low_memory=False):
    df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
    df.set_index('Accident_Index')    
    df.to_sql(table, csv_database, if_exists='append')

time_taken = timer() - start
print(f'The operation took {round(time_taken / 60, 2)} minutes.')

The operation took 5.47 minutes.


You can see in the output how much time it took to process the whole file.

Now we're ready to query our data. You can continue to use pandas by issuing SQL queries and saving them in dataframes. For example let's see how many rows we have:

In [33]:
df = pd.read_sql_query(f'SELECT count(*) FROM {table}', csv_database)
df

Unnamed: 0,count(*)
0,6224198


Or simply get a glimpse of our data:

In [34]:
import pandas as pd
from sqlalchemy import create_engine

csv_database = create_engine('sqlite:///csv_database.db') # the name of the database
table = 'original_csv'

df = pd.read_sql_query(f'SELECT * FROM {table} LIMIT 10', csv_database)
df

Unnamed: 0,index,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,0,197901A11AD14,,,,,1,3,2,1,...,-1,-1,1,8,1,-1,0,-1,-1,
1,1,197901A1BAW34,198460.0,894000.0,,,1,3,1,1,...,-1,-1,4,8,3,-1,0,-1,-1,
2,2,197901A1BFD77,406380.0,307000.0,,,1,3,2,3,...,-1,-1,4,8,3,-1,0,-1,-1,
3,3,197901A1BGC20,281680.0,440000.0,,,1,3,2,2,...,-1,-1,4,8,3,-1,0,-1,-1,
4,4,197901A1BGF95,153960.0,795000.0,,,1,2,2,1,...,-1,-1,4,3,3,-1,0,-1,-1,
5,5,197901A1CBC96,300370.0,146000.0,,,1,3,1,1,...,-1,-1,4,8,3,-1,0,-1,-1,
6,6,197901A1DAK71,143370.0,951000.0,,,1,3,2,2,...,-1,-1,4,8,3,-1,0,-1,-1,
7,7,197901A1DAP95,471960.0,845000.0,,,1,3,2,1,...,-1,-1,4,8,3,-1,0,-1,-1,
8,8,197901A1EAC32,323880.0,632000.0,,,1,2,1,1,...,-1,-1,4,3,3,-1,0,-1,-1,
9,9,197901A1FBK75,136380.0,245000.0,,,1,3,2,1,...,-1,-1,4,8,3,-1,0,-1,-1,


To meet our goal, we need to apply two filters to our query: we only want accidents that happened in the year 2000 and that have occured in London. 

For the first, we'll have to filter the `Date` and retrieve the values that have '2000' in it, and for the second we'll have to filter the `Police_Force` column and only retrieve the rows where this column equals 1 (which is the code for *London Metropolitan Police*)

In [40]:
df = pd.read_sql_query(f'SELECT * FROM {table} WHERE Police_Force = 1 AND `Date` LIKE "%2000"', csv_database)
print(f'Total number of accidents in London in the year 2000: {len(df)}')
df.head(10)

Total number of accidents in London in the year 2000: 37708


Unnamed: 0,index,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,5118264,2000010SU0982,522270.0,200330.0,-0.232572,51.688371,1,3,4,3,...,0,0,1,1,1,0,0,2,1,E01023584
1,5118265,2000010SU0983,536010.0,204970.0,-0.03211,51.726908,1,3,2,1,...,0,0,1,1,1,0,0,2,1,E01023310
2,5118266,2000010SU0984,519480.0,204200.0,-0.271588,51.723752,1,3,4,1,...,0,0,1,1,1,0,0,2,1,E01023584
3,5118267,2000010SU0985,520760.0,202280.0,-0.253731,51.706222,1,3,3,1,...,0,0,1,1,1,0,0,2,1,E01023584
4,5118268,2000010SU0986,523250.0,199890.0,-0.218557,51.684203,1,3,3,1,...,0,0,1,1,1,0,0,2,1,E01023584
5,5118269,2000010SU0987,521490.0,201610.0,-0.243404,51.700044,1,3,2,1,...,0,0,1,1,1,0,0,2,1,E01023584
6,5118270,2000010SU0988,521680.0,201190.0,-0.240803,51.696228,1,3,4,2,...,0,0,1,1,1,0,0,2,1,E01023584
7,5118271,2000010SU0989,511930.0,197690.0,-0.382936,51.666798,1,3,3,1,...,0,0,1,1,1,0,0,2,1,E01023552
8,5118272,2000010SU0990,512140.0,195650.0,-0.380554,51.648421,1,3,1,1,...,0,0,1,1,2,0,0,1,1,E01023553
9,5118273,2000010SU0991,511840.0,197370.0,-0.384339,51.66394,1,3,4,1,...,0,0,4,5,2,0,0,2,1,E01023554


### The final script

In [None]:
import pandas as pd
from sqlalchemy import create_engine

file = "Accidents7904.csv" # the file you want to read
csv_database = create_engine('sqlite:///csv_database.db') # the name of the database
table = 'original_csv' # the table's name
chunksize = 50000 # how many lines you wish to parse in every run. Change this according with your machine's RAM capabilities.

for df in pd.read_csv(file, chunksize=chunksize, iterator=True, low_memory=False):
    df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
    df.set_index('Accident_Index')    
    df.to_sql(table, csv_database, if_exists='append')

df = pd.read_sql_query(f'SELECT * FROM {table} WHERE Police_Force = 1 AND `Date` LIKE "%2000"', csv_database)
print(f'Total number of accidents in London in the year 2000: {len(df)}')
df.head(10)   