# Unleashing Pandas

## Importing a `csv` file

Let's play with the global land temperature dataset containing country-specific data from 1743 to 2013 downloaded from [https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/version/2?select=GlobalLandTemperaturesByCountry.csv](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/version/2?select=GlobalLandTemperaturesByCountry.csv). You can find a copy of the `csv` file in the `datasets` folder [https://github.com/raghurama123/DataScience/tree/main/datasets](https://github.com/raghurama123/DataScience/tree/main/datasets). 

In [None]:
import pandas as pd
temp=pd.read_csv('../datasets/GlobalLandTemperaturesByCountry.csv')
temp

## Remove missing entries

There are over half-a-million records, but some entries are missing denoted by `NaN`. Let's filter these and look at the remaining entries.

In [None]:
temp.isna().sum() # Total number of entries to be dropped.

In [None]:
temp=temp.dropna().reset_index(drop=True) # drop NaN entries and reassign row-indices
temp

## Quick inspection and pruning of the dataset

In [None]:
temp.describe()

It looks like some entries have large uncertainties. Note that that maximum uncertainty is as large as 15 degree Celsius. Let's remove these data points.

In [None]:
temp=temp[temp['AverageTemperatureUncertainty'] <= 0.5]  # Retain data with uncertainty < 0.5 degree Celsius
temp.describe()

In [None]:
temp

Looks like the row-indices are not reset. So, let's reset them.

In [None]:
temp=temp.reset_index(drop=True)
temp

In [None]:
temp.head()  # first few entries

In [None]:
temp.tail() # last few entries

## Now we can do some meaningful analysis

In [None]:
minval=temp['AverageTemperature'].min()
coolest=temp[temp['AverageTemperature'] ==minval] 
print(coolest)

In [None]:
tempGreenland=temp[temp["Country"] == "Greenland"]
tempGreenland['AverageTemperature'].plot.hist(bins=200, alpha=1.0)

In [None]:
maxval=temp['AverageTemperature'].max()
hottest=temp[temp['AverageTemperature'] ==maxval] 
print(hottest)

In [None]:
tempKuwait=temp[temp["Country"] == "Kuwait"]
tempKuwait['AverageTemperature'].plot.hist(bins=200, alpha=1.0)

It is actually possible to sort the entries according to one column. Then you can quickly glance at the extreme data points. 

In [None]:
temp.sort_values(by="AverageTemperature")