# INF4039 Deep Learning Systems / Giliojo mokymo sistemų taikymai
**LAB2**

# Introduction to Pandas

**pandas** is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. [https://pandas.pydata.org/]

In [None]:
#!pip install pandas

In [None]:
import pandas as pd

## Series and DataFrames

The primary two components of pandas are the Series and DataFrame. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

In [None]:
data_example = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

print(data_example)
type(data_example)

In [None]:
purchases = pd.DataFrame(data_example)
print(purchases)

In [None]:
type(purchases)

In [None]:
type(purchases['apples'])

In [None]:
type(purchases[['apples']])

# Most important DataFrame operations

## Import your data
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
df = pd.read_csv("titanic.csv")

Example of other arguments:

In [None]:
#df = pd.read_csv("titanic.csv", index_col=2)
#df = pd.read_csv("titanic.csv", index_col="Name")

## Viewing your data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html  
  
.head() outputs the first five rows of your DataFrame by default. .head(10) would output the top ten rows.

In [None]:
df.head()

To see the last five rows use .tail(). tail() also accepts a number.

In [None]:
df.tail()

The **display** function provides a cleaner display than merely printing the data frame.  Specifying the maximum rows and columns allows you to achieve greater control over the display.

In [None]:
pd.options.display.max_rows = 8
pd.options.display.max_columns = 8
display(df)

Try to uncomment and run code:

In [None]:
#pd.options.display.max_columns = None
#pd.options.display.max_rows = None
#display(df)

## Getting info about your data

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df['Survived'].value_counts()

## Rename collumns
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [None]:
df.columns

In [None]:
df.rename(columns={
        'Pclass': 'P_class', 
    }, inplace=True)

In [None]:
df.head()

## Descriptive statistics
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
df.describe()

In [None]:
df['Age'].describe()

## Correlation
Correlation coefficients quantify the association between variables or features of a dataset.

Use .corr() to calculate all three correlation coefficients. You define the desired statistic with the parameter method, which can take on one of several values: **pearson**, **spearman**, **kendall**  

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

# Pearson's r
print("Pearson's r")
print(x.corr(y))                     
print(y.corr(x))

# Spearman's rho
print("Spearman's rho")
print(x.corr(y, method='spearman'))  

# Kendall's tau
print("Kendall's tau")
print(x.corr(y, method='kendall'))   

In [None]:
corr_matrix = df.corr()
display(corr_matrix)

This example shows two ways of accessing values:

Use **.at[]** to access a single value by row and column labels.  
Use **.iat[]** to access a value by the positions of its row and column.  

In [None]:
corr_matrix.at['Survived', 'Age']

In [None]:
corr_matrix.iat[2, 4]

## How to work with missing values

Ideally, every row of data will have values for all columns. However, this is rarely the case. Missing values are a reality of machine learning.  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

In [None]:
df.isnull()

**df.isna().sum()** returns the number of missing values in each column.

In [None]:
df.isnull().sum()

### Removing null values

Drop **rows** with null values:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
df.dropna()

Parameter **axis** determine if rows or columns which contain missing values are removed.  

**0**, or **index** : Drop rows which contain missing values.  

**1**, or **columns** : Drop columns which contain missing value.  

In [None]:
df.dropna(axis=1)

### Imputation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Replace missing values with a scalar:

In [None]:
df['Age'] = df['Age'].fillna(25)

One of the practices is to replace missing values with the median value for that column:

In [None]:
median = df['Fare'].median()
df['Fare'] = df['Fare'].fillna(median)

Using method parameter, missing values can be replaced with the values before or after them.

In [None]:
df.fillna(method="ffill")

ffill stands for “forward fill” replaces missing values with the values in the previous row. You can also choose bfill which stands for “backward fill”.

## Standardization

In [None]:
# Strip non-numerics
df_2 = df.select_dtypes(include=['int', 'float'])

In [None]:
display(df_2)

In [None]:
#standartize
standartized_df=(df_2-df_2.mean())/df_2.std()

#normalize
normalized_df=(df_2-df_2.min())/(df_2.max()-df_2.min())

In [None]:
display(standartized_df)

In [None]:
display(normalized_df)

or

## DataFrame slicing, selecting, extracting

### By column

In [None]:
#df.loc[:, 'C':'E']
#is equivalent of
#df[['C', 'D', 'E']] or df.loc[:, ['C', 'D', 'E']]

In [None]:
subset_1 = df[['Fare','Survived']]
display(subset_1)

In [None]:
# Remember that Python does not slice inclusive of the ending index.
subset_2 = df.iloc[:, 0:4]
display(subset_2)

### By row

**.loc** - locates by name  
**.iloc** - locates by numerical index  

In [None]:
#collumns
df = df.iloc[:,0:30]
#rows
df = df.iloc[0:55,:]

In [None]:
age = df.loc["Age"]

In [None]:
age_1 = df.iloc[1]

In [None]:
subset = df.loc['Prometheus':'Sing']
subset = df.iloc[1:4]

## Concatenating Rows and Columns
Python can concatenate rows and columns together to form new data frames.

In [None]:
age = df['Age']
fare = df['Fare']
result = pd.concat([age, fare], axis=1)
display(result)

In [None]:
# Create a new dataframe from first 2 rows and last 2 rows
result = pd.concat([df[0:2],df[-2:]], axis=0)
display(result)

## Dropping Fields

Some fields are of no value to the neural network should be dropped.  The following code removes the selected collumn from dataset.  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

In [None]:
df_example_for_drop = df.copy()
print(f"Before drop: {list(df_example_for_drop.columns)}")

df_example_for_drop.drop('Name', 1, inplace=True)

print(f"After drop: {list(df_example_for_drop.columns)}")

display(df_example_for_drop)

## Replace values

In [None]:
df['Sex'].replace("female", 0, inplace=True)
df['Sex'].replace("male", 1, inplace=True)
display(df)

## Training and Validation

### From scratch

In [None]:
# Usually a good idea to shuffle
df = df.reindex(np.random.permutation(df.index)) 

mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])

print(f"Training DF: {len(trainDF)}")
print(f"Validation DF: {len(validationDF)}")

## Using libraries

#### Split into inputs and outputs

In [None]:
X = df.iloc[:,1:30]  #independent columns
y = df.iloc[:,0]    #target column       
print(X.shape, y.shape)

#### Split into train test sets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Converting a Dataframe to a Matrix
Neural networks do not directly operate on Python data frames.  A neural network requires a numeric matrix.  The program uses the **values** property of a data frame to convert the data to a matrix.

In [None]:
df.values

Convert some of the columns:

In [None]:
df[['Age', 'Survived']].values

In [None]:
X = X.values
y = y.values

# Plotting
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

In [None]:
#!pip install matplotlib

In [None]:
#import matplotlib.pyplot as plt

#set font and plot size to be larger if you want
#plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) 

### Scatter Plot

In [None]:
df.plot(kind='scatter', x='Age', y='Fare', title='Age vs Fare');

### Histogram

In [None]:
df['Age'].plot(kind='hist', title='Age');

### Boxplot

In [None]:
df['Age'].plot(kind="box");

In [None]:
df.boxplot(column='Age', by='Sex');

## Converting dataset back to a CSV

In [None]:
df.to_csv('clean_data.csv')