## Pandas Data Frames

- What Is Pandas?
- Pandas vs Numpy 
- Pandas Data Frame Intro
- Pandas Data Frame fundamental operations
    - Creating
    - Selecting/indexing
    - Inserting rows/columns
    - Setting data
    - Filtering
    - Dropping rows/ columns
- Dealing with Missing values

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from IPython.display import Image
from IPython.display import HTML
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))


In [None]:
from IPython.display import display,Image, HTML

CSS = """
.output {
    align-items: center;
}
div.output_area {
    width: 80%;
}
"""
HTML('<style>{}</style>'.format(CSS))

# What is Pandas?


### - Pandas is an open-source library built on top of Numpy
- https://github.com/pandas-dev/pandas
- https://github.com/pandas-dev/pandas/blob/059c8bac51e47d6eaaa3e36d6a293a22312925e6/pandas/core/frame.py

### - Enables working with tabular and labeled data easily and intuitively

### - Pandas data structures:
    - Series
    - Index
    - Data Frame
    

## Quick intro to Numpy Arrays
- contains Numerical ***Homogeneous*** Data
- may contain multi dimensional array elements.
- used for performing various numerical computations and processing of the multidimensional and single-dimensional array elements.

In [None]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

two_dim_arr = np.random.randint(10, size=(3, 4))  # Two-dimensional array
three_dim_arr = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array


### A two dimensional Array example...

In [None]:
print("Two Dimentional Array")
two_dim_arr

### What I mean by Homogeneous...

In [None]:
print(two_dim_arr)

**two_dim_arr[0,0] = "Hello"**

In [None]:
two_dim_arr[0,0] = "Hello" 

In [None]:
type(two_dim_arr)

### You can directly form the DataFrame from the 2D array

In [None]:
import pandas as pd

In [None]:

print("Data Frame formed by 2D Array")

df=pd.DataFrame(two_dim_arr)
df

### Pandas Data Frame is Heterogeneous!
**df.iloc[0,0]="Hello"**

In [None]:
df.iloc[0,0]="Hello"

In [None]:
df

### Pandas Data Frame labels the data with Indices and Columns labels
pd.DataFrame(np.random.randint(10,size=(3,2)),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

In [None]:
np.random.randint(10,size=(3,2))

In [None]:
##np.random.seed(0)
foo_df = pd.DataFrame(np.random.randint(10,size=(3,2)),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c']
                 )

In [None]:
foo_df

### Pandas DataFrame is relevant for statistical observations/data points with various types of variables (categorical, etc) 

In [None]:
Image("res/Tidy_census.png")

### It's intuitive...  Look how convenient it is

In [None]:
people_df= pd.read_csv("data/people.csv")
people_df

In [None]:
Image('res/excel-to-pandas.png')

In [None]:
name = ['Maria', 'Dylan', 'Philipp', 'Konsti']
gender = ['female', 'male', 'male', 'male']
fav_food = ['sushi', 'tacos', 'fried_chicken', 'lasagna']

In [None]:
students_dict = {'name':name, 'gender':gender, 'fav_food':fav_food}

In [None]:
students_df = pd.DataFrame(students_dict)

In [None]:
students_df.iloc[0,0] = 1

In [None]:
students_df

source: https://jalammar.github.io/

### Describing the Data Frame...
- df.info()
- df.count())
- df.describe())
- df.mean())

In [None]:
students_df.info()

In [None]:
people_df

In [None]:
people_df['age'].describe()

In [None]:
people_df.count()

### Pandas Data Frame operations

In [None]:
Image("res/CRUD.png")

### Data Frame creation
You can create/form a Data Frame from:
- Dict of 1D ndarrays, lists, dicts, or Series

- 2-D numpy.ndarray

- Structured or record ndarray

- A Series

- Another DataFrame

#### Here is an example...

In [None]:
print('example_df = {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}\n')

example_dict = {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}

example_dict

In [None]:
df=pd.DataFrame(example_dict,index =['a','b','c','d'])

In [None]:
df

#### creating Index for the Data frame...

In [None]:
print('dic = {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}\n')

dic= {"col1": [1.0, 2.0, 3.0, 4.0], "col2": [4.0, 3.0, 2.0, 1.0]}

dic

In [None]:
df=pd.DataFrame(dic)
df

### Creating a Dataframe from Pandas Series objects.. 

In [None]:
d = {
       "apples": [3, 2, 0,1],
        "oranges": [0, 3, 7, 2],
    }

pd.DataFrame(d)

In [None]:
Image("res/series-and-dataframe.width-1200.png")

source: https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png

### Data Frame Selection / Indexing

In [None]:
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

row_labels = [101, 102, 103, 104, 105, 106, 107]
students_df = pd.DataFrame(data=data, index=row_labels)
students_df

In [None]:
students_df.index

In [None]:
students_df = students_df.drop(105)

In [None]:
students_df.reset_index(drop=True)

## Source: https://realpython.com/

### Data Selection

In [None]:
students_df

In [None]:
students_df.iloc[4,1]

In [None]:
students_df.loc[[101],["age"]]

In [None]:
students_df.loc[106, 'city']

In [None]:
students_df

### Selecting by Label
- .loc[] attribute

In [None]:
students_df

In [None]:
#print("students_df.loc[:, 'city']")
#students_df.loc[:, 'city']

In [None]:
print("students_df.loc[102:106, ['name', 'city']]")
students_df.loc[102:106, ['name', 'city']]

In [None]:
#print('df["city"]')
cities = students_df[["name","city"]]
cities

In [None]:
students_df.loc[106]

In [None]:
print("df.city")
students_df.city

### Selecting by Position
- .iloc[]

In [None]:
students_df

In [None]:
students_df.iloc[0,1]

In [None]:
print("students_df.iloc[1:6, [0, 1]]")
students_df.iloc[1:6, [0, 1]]

In [None]:
students_df

In [None]:
list_ = list(range(10))

In [None]:
list_[6:9]

In [None]:
students_df.iloc[4:6,2:4]

### Can you tell what the difference is between loc and iloc?

### Setting/Updating data

#### Let us first update the Data frame index..

In [None]:
students_df.index = list(np.arange(0, 7))
students_df

### Inserting/deleting rows

In [None]:
students_df

In [None]:
name = ['Juan', 'Monika', 'Juliette']
city = ['Berlin','Paris', 'Paris']
age = [30,20,25]
py_score = [70,85,90]

new_students = { 'name' : name, 'city':city, 'age':age, 'py-score':py_score}

In [None]:
new_students

In [None]:
new_df = pd.DataFrame(new_students)
new_df

In [None]:
df = pd.concat([students_df,new_df])

In [None]:
df

In [None]:
df = df.reset_index(drop=True)

In [None]:
df

In [None]:
df.reset_index(drop=True, inplace=True)

### Inserting/Deleting columns

In [None]:
students_df

In [None]:
#print('df[js-score] = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0])')
students_df['js-score'] =[71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80]
students_df

In [None]:
students_df['py-score-updated'] = students_df['py-score'] * 10 

In [None]:
students_df

In [None]:
students_df.drop('js-score', axis=1)

### Inserting in a specific location

In [None]:
students_df.insert(loc=4, column='django-score',
          value=np.array([70, 74, 78, 56, 66, 78, 81.0]))
students_df

### Dropping specific column

In [None]:
## axis= 0 dropping by row,  axis=1 dropping by column
students_df.drop(columns='django-score')

### Filtering/Boolean Indexing

In [None]:
students_df["py-score"] >= 75

In [None]:
very_good_students_filter=students_df[students_df["py-score"]>=75]

In [None]:
very_good_students_filter

### Creating powerful filters with Logical operators AND, OR, NOT, XOR

In [None]:
#print('df[(df[py-score] >= 80) & (df[js-score] >= 80)]')
students_df[(students_df['py-score'] >= 70) & (students_df['django-score'] >= 80)] #or 

## Using value counts function

In [None]:
people_df=pd.read_csv("data/people.csv")
people_df

In [None]:
people_df['country'].unique()

In [None]:
people_df['country'].value_counts()

In [None]:
people_df['country'].nunique()

### Saving our df to csv

In [None]:
people_df.to_csv('people.csv')