By convention, `pandas` is imported as `pd`

# Working with pandas

Pandas is a powerful library for working with data. Pandas provides fast and easy functions for reading data from files, and analyzing it.

Pandas is based on another library called `numpy` - which is widely used in scientific computing. Pandas extends `numpy` and provides new data types such as **Index**, **Series** and **DataFrames**.

Pandas implementation is very fast and efficient - so compared to other methods of data processing - using `pandas` results is simpler code and quick processing. We will now re-implement our code for reading a file and computing distance using Pandas.

By convention, `pandas` is commonly imported as `pd`

In [3]:
import pandas as pd

In [10]:
import os
home_dir = os.path.expanduser('~')
data_pkg_path = 'Downloads/python_foundation/'
filename = 'worldcities.csv'
path = os.path.join(home_dir, data_pkg_path, filename)

A **DataFrame** is the most used Pandas object. You can think of a DataFrame being equivalent to a Spreadsheet or an Attribute Table of a GIS layer. 

Pandas provide easy methods to directly read files into a DataFrame. You can use methods such as `read_csv()`, `read_excel()`, `read_hdf()` and so forth to read a variety of formats. Here we will read the `worldcitites.csv` file using `read_csv()` method.

In [11]:
df = pd.read_csv(path)

Once the file is read and a DataFrame object is created, we can inspect it using the `head()` method. You can see that we are not using `print()` here. Jupyter notebooks call the `display()` method on objects implicitely and gives us a nicely formatted output. This is very useful when dealing with DataFrames.

In [12]:
df.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15493 entries, 0 to 15492
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        15493 non-null  object 
 1   city_ascii  15493 non-null  object 
 2   lat         15493 non-null  float64
 3   lng         15493 non-null  float64
 4   country     15493 non-null  object 
 5   iso2        15462 non-null  object 
 6   iso3        15493 non-null  object 
 7   admin_name  15302 non-null  object 
 8   capital     5246 non-null   object 
 9   population  13808 non-null  float64
 10  id          15493 non-null  int64  
dtypes: float64(3), int64(1), object(7)
memory usage: 1.3+ MB


In [8]:
df.describe()

Unnamed: 0,lat,lng,population,id
count,15493.0,15493.0,13808.0,15493.0
mean,29.633315,-29.834189,181248.0,1623208000.0
std,22.414727,76.340457,794798.9,282645100.0
min,-54.9333,-179.59,0.0,1004003000.0
25%,22.305,-86.3242,9167.5,1404601000.0
50%,37.7562,-71.9167,23496.5,1826644000.0
75%,42.4442,25.5821,90306.25,1840015000.0
max,82.4833,179.3833,35676000.0,1934000000.0


In [33]:
home_country = 'India'
country_df = df[df['country'] == home_country].copy()
country_df.shape

(212, 12)

In [41]:
home_city = 'Bengaluru'
filtered = df[df['city_ascii'] == home_city] 
home_city_coordinates = (filtered.iloc[0]['lat'], filtered.iloc[0]['lng'])

home_city_coordinates

(12.97, 77.56)

In [46]:
from geopy import distance

def calculate_distance(row):
    city_coordinates = (row['lat'], row['lng'])
    return distance.geodesic(city_coordinates, home_city_coordinates).km

result = country_df.apply(calculate_distance, axis=1)
country_df['distance'] = result
country_df

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id,distance
3,Mumbai,Mumbai,19.0170,72.8570,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629,837.185709
5,Delhi,Delhi,28.6700,77.2300,India,IN,IND,Delhi,admin,15926000.0,1356872604,1738.638856
7,Kolkata,Kolkata,22.4950,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520,1552.637823
34,Chennai,Chennai,13.0900,80.2800,India,IN,IND,Tamil Nādu,admin,7163000.0,1356374944,295.340107
36,Bengalūru,Bengaluru,12.9700,77.5600,India,IN,IND,Karnātaka,admin,6787000.0,1356410365,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
7305,Karūr,Karur,10.9504,78.0833,India,IN,IND,Tamil Nādu,,76915.0,1356837900,230.567496
7441,Jorhāt,Jorhat,26.7500,94.2167,India,IN,IND,Assam,,69033.0,1356638741,2312.574457
7583,Sopur,Sopur,34.3000,74.4667,India,IN,IND,Jammu and Kashmīr,,63035.0,1356978065,2383.154991
7681,Tezpur,Tezpur,26.6338,92.8000,India,IN,IND,Assam,,58851.0,1356299437,2195.314732


In [47]:
output_filename = 'cities_distance_pandas.csv'
output_dir = 'Downloads'
output_path = os.path.join(home_dir, output_dir, output_filename)
india_df.to_csv(output_path, index=False)