# 1. Pandas library 

To use Panda first load the panda library

In [None]:
import numpy as np
import pandas as pd

# 2. Panda Data Structures

Panda have three fundamental Pdata structures: the **Series**, **DataFrame**, and **Index**.

## 2.1 Pandas Series

A pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

A series can be constructed with the *pd.Series* constructor (passing a list or array of values) or from a DataFrame, by extracting one of its columns.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

As we see in the output above, the series has both a sequence of **values** and a sequence of **indices**. We can access these with the *values* and *index* attributes.

In [None]:
data.values

In [None]:
data.index

Like with a NumPy array, data can be accessed by the associated index

In [None]:
data[2]

## 2. 2 Pandas DataFrame

A 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL tables. 

It is generally the most commonly used pandas object

## 2.3 Pandas Index
An immutable array or as an ordered set which is used to reference and modify data.

# Exercise 1: 

Create a new panda series containing any data.

# 3. Reading from CSV file into dataframe

You can read data from a CSV file using the **read_csv** function. By default, it assumes that the fields are comma-separated.

In [None]:
cities = pd.read_csv('Data/Cities.csv')


In [None]:
# List all the columns in the DataFrame
cities.columns

In [None]:
# We can use the len function again here to see how many columns there are in the dataframe: 213
len(cities)

In [None]:
# How big is this dataframe (rows, columns)
cities.shape

In [None]:
# Let view the first few rows 
cities.head()

Notice that read_csv automatically considered the first row in the file to be a header row.
We can override default behavior by customizing some the arguments, like header, names or index_col.

In [None]:
# View Last 4 rows
cities.tail(4)

# Exercise 2: 

* Load the Titanic data  as a pandas data frame and inspect the first 5 rows.
* How many row does the data set contain?

We can also inspect the format for our columns. We can see that some are integers, some are 'float' (can have a decimal), and some are 'objects' (text). If you have a identifying text variable that has accidentally been imported as a float, for instance, that could cause problems down the road, so you should fix it before continuing.

In [None]:
cities.dtypes

## 3. 2 Adding and Droping column
Let us add another column to the cities dataframe. Suppose we want to add temperature in Farenheight

In [None]:
cities['tempF']=cities['temperature']*9/5+32

In [None]:
cities.head()

It clear from the above result we can perform arthmetic operation on pandas dataframe.

# <font color="red">Exercise</font>

Add Fare column in Tsh to the titanic dataframe: Hint 1USD = 2000Tsh

In [None]:
#

### Dropping Column

We can  delete column in panda dataframe. Let us delete the tempF column in cities dataframe.

In [None]:
cities.drop('tempF', axis=1, inplace=True)
cities.head()

### Note:
 1. **axis=1** denotes that we are referring to a column, not a row
 2. **inplace=True** means that the changes are saved to the df right away

# <font color="red">Exercise</font>
* Drop the Fare in Tsh you created in the previous exercise.
* Also delete the ticket and cabin column.

**Hint**: To delete multiple column use *dataframe.drop(['Column_name1', 'Column_name2'], axis=1)*.

In [None]:
#

## 3.3  Slicing Subsets of Rows and Columns in Python

#### Selecting a single column - returns a 'series'

In [None]:
cities.city

In [None]:
# Also show try
# cities['temperature']

#### Selecting multiple columns - returns a dataframe


In [None]:
#cities[['city','temperature']]


#### Selecting rows by number

In [None]:
cities[15:20]


In [None]:
# Try  cities[:8] and cities[200:]

## 3.4 Position Based Selection of columns and lows
Pandas allows us to use position based indexing implemented with iloc and loc: 
>**.loc** for label-based indexing

>**.iloc** for positional indexing

#### To slice a specific column using label indexing

In [None]:
# And here is how to slice a column:
cities.loc[: , "temperature"]

We can also use postion indexing

In [None]:
cities.iloc[:,4] 

####  To extract only a row you would do the inverse:

In [None]:
cities.iloc[4,:]

#### To select range of row and column

In [None]:
## Select first four rows(including header) and first three column (including SN)
cities.iloc[0:3,0:2]

#### Select only the specified range of column

In [None]:
cities.iloc[:,0:2] #

#### To select  different columns 

In [None]:
cities.iloc[:,[1, 4]]

# <font color="red">Exercise</font>

* What happens when you type the code below?
> ```python
     cities.loc[[0, 10, 200], :]
  ```

* What happens when you type
> ```python
    cities.iloc[0:4, 1:4]
    cities.loc[0:4, 1:4]
  ```
 How are the two commands different?
 
 
* What happens when you type:
> 
```python
          cities[0:3]
          cities[:5]
          cities[-1:] 
```          
          

## 3.5 Subsetting Data Using Criteria
We can also select a subset of our data using criteria. For example, we can select all rows that have temprature higher than 15.

In [None]:
cities[cities.temperature > 15]

Or we can select all rows which are in France

In [None]:
cities[cities.country == 'France']

We can also use string operations. For exampme select rows with countries with **"ia"** in their name.

In [None]:
cities[cities.country.str.contains('ia')]

# <font color="red">Exercise</font>

Select all teenagers female passengers in titanic dataset.

## 3.6  Sort Data in Pandas

We can also sort data in pandas. For example let us  sort the dataframe's rows by latitude, in descending order.

In [None]:
cities.sort_values(by='latitude', ascending=0)

Sorting by country and then by  temperature descending

In [None]:
cities.sort_values(by=['country','temperature'],ascending=[True,False])

 **Putting it together**: Select City and longitude of all cities with latitude > 50 and temperature > 9, sorted by longitude

In [None]:
temp1 = cities[(cities.latitude > 50) & (cities.temperature > 9)]
temp2 = temp1[['city','longitude']]
temp2.sort_values(by='longitude')
# Show combining first two, then combining all (use \ for long lines)
# Note similar functionality to SQL

The above code is equivalent to:

In [None]:
cities[(cities.latitude > 50) & (cities.temperature > 9)][['city','longitude']].sort_values(by='longitude')

# <font color="red">Exercise</font>

Use the titanic data to print the name, age and gender of all survived passengers less than 20 years old in class C sorted by their age (smallest to lagest).

## 4. Descriptive Statistics  From Data

Descriptive statistics can give you great insight into the shape of each attribute. The **describe()** function on the Pandas DataFrame lists 8 statistical properties of each attribute:

* Count
* Mean
* Standard Devaition
* Minimum Value
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum Value

For example to obtain the statistics summary  for Cities data.

In [None]:
cities.describe()

To obtain descriptive statistics of a particular column use:

In [None]:
cities['temperature'].mean()

However we often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average tempearture of all cities per countries. To accomplish this we  can use  **.groupby** method

In [None]:
cities.groupby('country').describe()

Let find average temperature of cities in each country

In [None]:
cities.groupby('country')[['temperature']].mean()

In [None]:
# Average temperature of cities in France.
french = cities[cities.country == 'France']
french['temperature'].mean()

# <font color="red">Exercise</font> 

In [None]:
# Read the Players  data into dataframes
# Find the average highest point of countries in the EU and not in the EU,
# then the average highest point of countries with and without coastline

#Note: When there's more than one "query" you only see the last result.
# Try using print
# Hint: You can use groupby!

### Writing Out Data to CSV

We can use the **to_csv** command to do export a DataFrame in CSV format. We can save it to a different folder by adding the foldername and a slash to the file **.to_csv('foldername/filename.csv')**.


In [None]:
#Save the citiesext to harddisk
cities.to_csv('Data/cities_modified.csv')

# 6 Plotting and Visualization

There are a handful of third-party Python packages that are suitable for creating scientific plots and visualizations. These include packages like:

1. [Matplotlib](http://matplotlib.org/)
2. [Seaborn](http://seaborn.pydata.org/)
3. [Bokeh](http://bokeh.pydata.org/en/latest/)

However, Pandas have a **.plot** namespace, with various chart types available **(line, hist, scatter, etc.)**.

We will focus excelusively on matplotlib and the high-level plotting availabel within pandas. 

## 6.1 Pandas' builtin-plotting

Plot Formatting

In [None]:
from pylab import rcParams
from math import sqrt
%matplotlib inline

fig_width = 6.9
golden_mean = (sqrt(5)-1.0)/2.0    # Aesthetic ratio
fig_height = fig_width*golden_mean # height in inches

params = {
   'axes.labelsize': 8,
   'text.latex.preamble': ['\\usepackage{gensymb}'],
   'font.size': 10,
    'axes.labelsize': 10, # fontsize for x and y labels (was 10)
    'axes.titlesize': 12,
   'legend.fontsize': 8,
   'xtick.labelsize': 10,
   'ytick.labelsize': 10,
   'text.usetex': True,
   'figure.figsize': [fig_width,fig_height],
    'font.family': 'serif'
   }
rcParams.update(params)

In [None]:
# Load citiesext dataset
citiesext = pd.read_csv('Data/citiesext.csv')

#### Line plot

In [None]:
citiesext.latitude.plot()

#### Plot Scatter plot between latitude and temperature

In [None]:
citiesext.plot.scatter(x='temperature', y='latitude', c='#E88C0C', s=55)

#### Histograms¶

In [None]:
citiesext.temperature.hist(grid=False )

There are algorithms for determining an "optimal" number of bins, each of which varies somehow with the number of observations in the data series.

Here, we had to normalize the histogram (normed=True), since the kernel density is normalized by definition (it is a probability distribution).

### Boxplots
A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [None]:
citiesext.boxplot(column='temperature', by='EU', grid=False)

You can think of the box plot as viewing the distribution from above. The white circles are "outlier" points that occur outside the extreme quantiles.