# Pandas

## Installation

In [4]:
#!pip install pandas --upgrade --user


## Importing convetion

In [1]:
import pandas as pd
import numpy as np

# What is Pandas?

Pandas can be thought as an enhanced version of numpy arrays. In this case, the rows and columns can be identified with labels instead of just simple integer indices.

There are **three** main pandas elements we **need** to understand.
1. Pandas Series
2. Pandas DataFrame
3. Index

# The Pandas Series

A pandas series is a one-dimensional (**1-D**) indexed array.

In [None]:
pd

In [None]:
pd.Series()

## Creating a pandas Series from a list

You'll start to recognize a pandas Series by its visual 

In [4]:
a = pd.Series([5,8,3,3]).nbytes
b = pd.Series(["Assim",8,3]).nbytes
c = pd.Series([5.0,8,3]).nbytes

print(a,b,c)

32 24 24


`dtype` means the `data type` of what is inside your pandas Series.

In [None]:
# As in lists, you don't need to have all of the same type inside a pandas series

pd.Series(['a', 2, 3])

When you see `dtype: object`, it usually means you have a `str` inside your `Series`

In [None]:
data = pd.Series([10,23,3,43,25,136])

In [None]:
data

In [None]:
type(data)

So, the `type` of `data` is a `pandas...Series` and the types of the data inside the `pandas.Series` is `int`

## Accessing elements 

Can be done like a numpy array. 

In [None]:
data

In [None]:
data[0]

In [None]:
data[4:]

Em resumo: pandas series pode ser considerado uma numpy array de 1-D

### What is the difference then? Numpy array vs Pandas Series

Mostly the index notation.

Numpy arrays only have the **implicit** index associated with its location. By using a **explicit** index notation, Pandas Series are much more flexible. For example:

## Index don't need to be numbers.

In [None]:
my_series = pd.Series(data=[1,2,3,5,7,9], index=['Andre','Joao','Andre','Vampirão','Tieko','Satiko']) #index argument
my_series

## Index don't need to be in sequence

In [None]:
data = pd.Series(data=[1,2,3,4], 
                 index=[1,7,4313,19])

In [None]:
data

### Then how can I access these pandas series?

In [None]:
my_series

In [None]:
my_series['Andre']

In [None]:
my_series['Andre'].mean()

**NOTE:** One can think of a pandas series, then, as a form of dictionary, in which the indexes are keys and the rows are the values

In [None]:
my_series

In [None]:
my_series.keys()

In [None]:
my_series.values

In [None]:
my_series * 3

## Creating a pandas series from a dict.

In [None]:
my_dict = {'JOAO': 20, 
           'ANDRE':10}

In [None]:
my_dict

In [None]:
pd.Series(my_dict)

But what about > 1-D?


# Pandas DataFrame


Pandas Dataframes can be thought as a generalization of **2-D** numpy arrays. However, again, they bring flexibility on both the indices and column names.

In [None]:
pd.DataFrame()

In [None]:
type(pd.DataFrame())

## Pandas DataFrame can be thought as a group of Pandas Series

In [None]:
my_dict = {'JOAO': 25, 
           'ANDRE':28}

data = pd.Series(my_dict)

In [None]:
data

In [None]:
another_dict = {'JOAO': 177,'ANDRE': 175}

data_2 = pd.Series(another_dict)

In [None]:
data_2

In [None]:
my_series

In [None]:
my_series2 = pd.Series(['Professor','TA','Aluno','Aluno','Aluna','Aluna'], index=['Andre','Matheus','Ale','Thiago','Leticia','Gabriela'])
my_series2

In [None]:
my_dict = {'nota': my_series, 'cargo': my_series2}
my_dict

In [None]:
pd.DataFrame(my_dict).describe()

# Create dataframe as a collection of Series

In [None]:
{'idade':data, 'altura':data_2}

In [None]:
my_dataframe = pd.DataFrame({'idade':data, 'altura':data_2})
my_dataframe 

**NOTE:**: So a dataframe can be thought of as a dictionary, in which `keys` are the `column names` and `values` are the `pandas Series` themselves

In [None]:
my_dataframe['altura']

In [None]:
my_dataframe.altura

# `Access` Methods: Accessing dataframes rows and columns

These are the correct way to access data in a dataframe. You can specify both row and column. You can also specify only row.

In [None]:
my_dataframe

## `dataframe.loc[row_name, col_name]`

In [None]:
my_dataframe.loc['JOAO', 'idade']

In [None]:
my_dataframe.loc[:, 'idade']

In [None]:
my_dataframe.loc[:, 'idade']

In [None]:
my_dataframe.loc['ANDRE', 'altura']

## `dataframe.iloc[row_number, col_number]`

In [None]:
my_dataframe

In [None]:
my_dataframe.iloc[0, 0]

In [None]:
my_dataframe.iloc[1, 1]

In [None]:
my_dataframe.iloc[-1, 1]

What is the difference of selecting a column via: `my_dataframe['idade']` vs `my_dataframe.loc[:, 'idade']`?

# Creating dataframes

## From a list in 1-D

In [None]:
my_list = [1,2,3]

In [None]:
np.array(my_list)

In [None]:
np.array(my_list).shape

In [None]:
pd.DataFrame(data=my_list)

In [None]:
pd.DataFrame(data=my_list, columns=['notas'], index=['Andre','Maria','Joao'])

## From a list in > 1-D (let's remember numpy arrays here!)

In [None]:
my_list = [[1,2,3],
           [-5,-6,-7]]

In [None]:
np.array(my_list)

In [None]:
np.array(my_list).shape

In [None]:
df = pd.DataFrame(data=my_list, columns=['idade','peso','altura'])
df

In [None]:
df.shape

## From a dictionary composed by lists

In [None]:
pd.DataFrame({'ironhack_students': ['a','b','c'],
              'NOTA':[10, 10, 0]})

## From a numpy array

In [None]:
a = np.random.random(size=(5, 4))
a

In [None]:
data = pd.DataFrame(a, columns=['altura', 'peso', 'idade', 'largura'])

In [None]:
data

### Accessing rows and columns:

#### `.loc`

remember: `.loc` receives `[row_name, column_name]`

In [None]:
# get third row and column `peso`: result should be 0.285021

data.loc[2, 'peso']

In [None]:
# get entire third row

data.loc[2, :]

In [None]:
# get entire `idade` column

# data['idade']
# data.idade
data.loc[:, 'idade']

In [None]:
data

In [None]:
# get all rows from column `peso` up to `largura`

data.loc[:, 'peso':'largura']

In [None]:
data.loc[:, ['peso','idade','largura']]

In [None]:
data.loc[:3, ['peso','idade','largura']]

In [None]:
data.loc[:-1, ['peso','idade','largura']]

In [None]:
data.loc[:, 'idade']

In [None]:
data.loc[:, 'idade'].shape

In [None]:
data.loc[:, ['idade']]

In [None]:
data.loc[:, ['idade']].shape

#### `.iloc`

In [None]:
data.iloc[0:4, :]

In [None]:
data.iloc[0:4, 1:4]

In [None]:
data.iloc[-1, :]

## Math operations

In [None]:
data = np.random.random(size=(8, 4))

In [None]:
df = pd.DataFrame(data, columns=['Andre','Maria','Joao','Vamp'])

In [None]:
df.transpose()

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

In [None]:
df.std()

In [None]:
df.describe()

In [None]:
df['Total'] = df['Andre'] + df['Vamp']
df

In [None]:
df.mean()

In [None]:
df['Andre']

# Pandas Index

In [None]:
pd.Index([1,2,3])

In [None]:
df

In [None]:
df.index

In [None]:
df.index.values