# Introduction to Pandas

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("data/gapminder_gdp_europe.csv", index_col='country')

In [None]:
type(df)

In [None]:
data

In [None]:
# Use DataFrame.iloc[..., ...] to select values by their (entry) position

print(data.iloc[0, 0])


In [None]:
#Use DataFrame.loc[..., ...] to select values by their (entry) label.

print(data.loc["Albania", "gdpPercap_1952"])


In [None]:
#Use : on its own to mean all columns or all rows.

print(data.loc["Albania", :])


In [None]:
print(data.loc[:, "gdpPercap_1952"])


In [None]:
# Select multiple columns or rows using DataFrame.loc and a named slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])


In [None]:
# Result of slicing can be used in further operations.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())


In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())


In [None]:
# Use comparisons to select data based on value.


In [None]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

In [None]:
#Select values or NaN using a Boolean mask.
#A frame full of Booleans is sometimes called a mask because of how it can be used.

mask = subset > 10000
print(subset[mask])

In [None]:
#Get the value where the mask is true, and NaN (Not a Number) where it is false.
#Useful because NaNs are ignored by operations like max, min, average, etc.
print(subset[subset > 10000].describe())

## Group By: split-apply-combine
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

For instance, let’s say we want to have a clearer view on how the European countries split themselves according to their GDP.

1. We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
2. We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP

In [None]:
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score


In [None]:
data.groupby(wealth_score).sum()


### EXERCISE 1: Selection of Individual Values
Write an expression to find the Per Capita GDP of Serbia in 2007.



### EXERCISE 2: Extent of Slicing
Do the two statements below produce the same output?
Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

In [None]:
print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

### Reconstructing Data
Explain what each line in the following short program does: what is in first, second, etc.?

In [None]:
first = pd.read_csv('data/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')

### Selecting Indices
Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?



In [None]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())


### Practice with Selection
Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:

1. GDP per capita for all countries in 1982.
2. GDP per capita for Denmark for all years.
3. GDP per capita for all countries for years after 1985.
4. GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.

### Many Ways of Access
There are at least two ways of accessing a value or slice of a DataFrame: by name or index. However, there are many others. For example, a single column or row can be accessed either as a DataFrame or a Series object.

Suggest different ways of doing the following operations on a DataFrame:

1. Access a single column
2. Access a single row
3. Access an individual DataFrame element
4. Access several columns
5. Access several rows
6. Access a subset of specific rows and columns
7. Access a subset of row and column ranges