# Working with Pandas

![](images/python_foundation/pandas-logo.png)

Pandas is a powerful library for working with data. Pandas provides fast and easy functions for reading data from files, and analyzing it.

Pandas is based on another library called `numpy` - which is widely used in scientific computing. Pandas extends `numpy` and provides new data types such as **Index**, **Series** and **DataFrames**.

Pandas implementation is very fast and efficient - so compared to other methods of data processing - using `pandas` results is simpler code and quick processing. We will now re-implement our code for reading a file and computing distance using Pandas.

By convention, `pandas` is commonly imported as `pd`

In [4]:
import pandas as pd

## Reading Files

In [5]:
import os
data_pkg_path = 'data'
filename = 'worldcities.csv'
path = os.path.join(data_pkg_path, filename)

A **DataFrame** is the most used Pandas object. You can think of a DataFrame being equivalent to a Spreadsheet or an Attribute Table of a GIS layer. 

Pandas provide easy methods to directly read files into a DataFrame. You can use methods such as `read_csv()`, `read_excel()`, `read_hdf()` and so forth to read a variety of formats. Here we will read the `worldcitites.csv` file using `read_csv()` method.

In [6]:
df = pd.read_csv(path)

In [9]:
df

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6850,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.1310,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.0170,72.8570,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.6250,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
...,...,...,...,...,...,...,...,...,...,...,...
15488,Timmiarmiut,Timmiarmiut,62.5333,-42.2167,Greenland,GL,GRL,Kujalleq,,10.0,1304206491
15489,Cheremoshna,Cheremoshna,51.3894,30.0989,Ukraine,UA,UKR,Kyyivs’ka Oblast’,,0.0,1804043438
15490,Ambarchik,Ambarchik,69.6510,162.3336,Russia,RU,RUS,Sakha (Yakutiya),,0.0,1643739159
15491,Nordvik,Nordvik,74.0165,111.5100,Russia,RU,RUS,Krasnoyarskiy Kray,,0.0,1643587468


Once the file is read and a DataFrame object is created, we can inspect it using the `head()` method. 

In [None]:
print(df.head())

There is also a `info()` method that shows basic information about the dataframe, such as number of rows/columns and data types of each column.

In [None]:
print(df.info())

## Filtering Data

Pandas have many ways of selecting and filtered data from a dataframe. We will now see how to use the [Boolean Filtering](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) to filter the dataframe to rows that match a condition.

In [10]:
home_country = 'India'
filtered = df[df['country'] == home_country]
filtered

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
3,Mumbai,Mumbai,19.0170,72.8570,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
5,Delhi,Delhi,28.6700,77.2300,India,IN,IND,Delhi,admin,15926000.0,1356872604
7,Kolkata,Kolkata,22.4950,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520
34,Chennai,Chennai,13.0900,80.2800,India,IN,IND,Tamil Nādu,admin,7163000.0,1356374944
36,Bengalūru,Bengaluru,12.9700,77.5600,India,IN,IND,Karnātaka,admin,6787000.0,1356410365
...,...,...,...,...,...,...,...,...,...,...,...
7305,Karūr,Karur,10.9504,78.0833,India,IN,IND,Tamil Nādu,,76915.0,1356837900
7441,Jorhāt,Jorhat,26.7500,94.2167,India,IN,IND,Assam,,69033.0,1356638741
7583,Sopur,Sopur,34.3000,74.4667,India,IN,IND,Jammu and Kashmīr,,63035.0,1356978065
7681,Tezpur,Tezpur,26.6338,92.8000,India,IN,IND,Assam,,58851.0,1356299437


In [11]:
filtered = df[df['population'] > 1000000]
filtered

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6850,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.1310,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.0170,72.8570,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.6250,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
...,...,...,...,...,...,...,...,...,...,...,...
493,Rotterdam,Rotterdam,51.9200,4.4800,Netherlands,NL,NLD,Zuid-Holland,minor,1005000.0,1528892850
494,Homs,Homs,34.7300,36.7200,Syria,SY,SYR,Ḩimş,admin,1005000.0,1760013934
495,Cologne,Cologne,50.9300,6.9500,Germany,DE,DEU,North Rhine-Westphalia,,1004000.0,1276015998
496,Qinhuangdao,Qinhuangdao,39.9304,119.6200,China,CN,CHN,Hebei,,1003000.0,1156091093


Filtered dataframe is a just view of the original data and we cannot make changes to it. We can save the filtered view to a new dataframe using the `copy()` method.

In [12]:
country_df = df[df['country'] == home_country].copy()

In [43]:
home_city = 'Mumbai'
city_df = df[df['city'] == home_city].iloc[0]['lng']
city_df

72.857

In [None]:
# iloc[]

In [36]:
home_city = 'Bengaluru'

country_df[country_df['city_ascii'] == home_city].iloc[0]['lng']

77.56

To locate a particular row or column from a dataframe, Pandas providea `loc[]` and `iloc[]` methods - that allows you to *locate* particular slices of data. Learn about [different indexing methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing) in Pandas. Here we can use `iloc[]` to find the row matching the `home_city` name. Since `iloc[]` uses index, the *0* here refers to the first row.

In [40]:
home_city = 'Bengaluru'
filtered = country_df[country_df['city_ascii'] == home_city]
print(filtered.iloc[0])

city           Bengalūru
city_ascii     Bengaluru
lat                12.97
lng                77.56
country            India
iso2                  IN
iso3                 IND
admin_name     Karnātaka
capital            admin
population     6787000.0
id            1356410365
Name: 36, dtype: object


Now that we have filtered down the data to a single row, we can select individual column values using column names.

In [38]:
home_city_coordinates = (filtered.iloc[0]['lat'], filtered.iloc[0]['lng'])
print(home_city_coordinates)

(12.97, 77.56)


## Performing calculations

Let's learn how to do calculations on a dataframe. We can iterate over each row and perform some calculations. But pandas provide a much more efficient way. You can use the `apply()` method to run a function on each row. This is fast and makes it easy to complex computations on large datasets.

The `apply()` function takes 2 arguments. A function to apply, and the axis along which to apply it. `axis=0` means it will be applied to columns and `axis=1` means it will apply to rows.

![](images/python_foundation/pandas_axis.png)

In [None]:
from geopy import distance

def calculate_distance(row):
    city_coordinates = (row['lat'], row['lng'])
    return distance.geodesic(city_coordinates, home_city_coordinates).km

result = country_df.apply(calculate_distance, axis=1)
print(result)

We can add these results to the dataframe by simply assigning the result to a new column.

In [None]:
country_df['distance'] = result
print(country_df)

We are done with our analysis and ready to save the results. We can further filter the results to only certain columns.

In [None]:
filtered = country_df[['city_ascii','distance']]
print(filtered)

Let's rename the `city_ascii` column to give it a more readable name.

In [None]:
filtered = filtered.rename(columns = {'city_ascii': 'city'})
print(filtered)

Now that we have added filtered the original data and computed the distance for all cities, we can save the resulting dataframe to a file. Similar to read methods, Pandas have several write methods, such as `to_csv()`, `to_excel()` etc.

Here we will use the `to_csv()` method to write a CSV file. Pandas assigns an index column (unique integer values) to a dataframe by default. We specify `index=False` so that this index is not added to our output.

In [None]:
output_filename = 'cities_distance_pandas.csv'
output_dir = 'output'
output_path = os.path.join(output_dir, output_filename)
filtered.to_csv(output_path, index=False)
print('Successfully written output file at {}'.format(output_path))

## Exercise

You will notice that the output file contains a row with the `home_city` as well. Modify the `filtered` dataframe to remove this row and write it to a file.

Hint: Use the Boolean filtering method we learnt earlier to select rows that do not match the `home_city`.

----