**Data preparation** (also referred to as “data preprocessing”) is the process of transforming raw data so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions. 

Data Transformation
https://medium.com/@neevarp.v/data-preparation-for-machine-learning-data-cleaning-data-transformation-data-reduction-c4c86c4471a1
There are two types of data and all other forms of data like text, image, and video must be converted to one of these forms.
Numeric (continuous) — predicted using regression models. ML algorithms typically do not work well with numeric data with different scales.
Categorical — predicted using classification models. Categorical data has to be converted to numeric form before applying any ML model.

## Task1

Prepare "Titanic.csv" dataset for deep learning. Explain in report what actions you did and why. 

## Task2

Choose and prepare your own dataset for deep learning. Do not choose a fully prepared dataset. Explain in report what actions you did and why. 

# Introduction to Pandas

**pandas** is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. [https://pandas.pydata.org/]  
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

In [None]:
#!pip install pandas

In [2]:
import pandas as pd

## Series and DataFrames

The primary two components of pandas are the Series and DataFrame. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series. 

https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html

In [None]:
data_example = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

print(data_example)
type(data_example)

{'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2]}


dict

In [None]:
purchases = pd.DataFrame(data_example)
print(purchases)

   apples  oranges
0       3        0
1       2        3
2       0        7
3       1        2


In [None]:
type(purchases)

pandas.core.frame.DataFrame

In [None]:
type(purchases['apples'])

pandas.core.series.Series

In [None]:
type(purchases[['apples']])

pandas.core.frame.DataFrame

In this laboratory work, we will discuss the main functions used for DataFrame.

# Most important DataFrame operations

## Import your data
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [4]:
#for GoogleCollab
#https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92
url = 'https://raw.githubusercontent.com/vytkuc/inf5007/main/Titanic.csv'
df = pd.read_csv(url)

In [None]:
#for local user
#df = pd.read_csv("Titanic.csv")

Example of other parameters:

In [None]:
#df = pd.read_csv("titanic.csv", index_col=2)
#df = pd.read_csv("titanic.csv", index_col="Name")
#df = pd.read_csv("titanic.csv", delimiter=";")

## Viewing your data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html
  
**.head()** outputs the first five rows of your DataFrame by default. .head(10) would output the top ten rows.

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


To see the last five rows use .tail(). tail() also accepts a number.

In [None]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


The **display** function provides a cleaner display than merely printing the data frame.  Specifying the maximum rows and columns allows you to achieve greater control over the display.

In [3]:
pd.options.display.max_rows = 8
pd.options.display.max_columns = 8
display(df)

NameError: ignored

Try to uncomment and run code:

In [None]:
pd.options.display.max_columns = None
pd.options.display.max_rows = 10


In [None]:
display(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


These values will be the same for all the time you will use display() function until you change it by running cell with another values.  
https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

## Getting info about your data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html

In [None]:
df.shape

(891, 12)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html

In [None]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

## Rename collumns
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [None]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
df.rename(columns={
        'Pclass': 'P_class', 
    }, inplace=True)

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,P_class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Descriptive statistics
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.

In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,P_class,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
df['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

## Correlation
Correlation coefficients quantify the association between variables or features of a dataset.

Use .corr() to calculate all three correlation coefficients. You define the desired statistic with the parameter method, which can take on one of several values: **pearson**, **spearman**, **kendall**  

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

In [None]:
x = pd.Series(range(10, 20))
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])

# Pearson's r
print("Pearson's r")
print(x.corr(y))                     
print(y.corr(x))

# Spearman's rho
print("Spearman's rho")
print(x.corr(y, method='spearman'))  

# Kendall's tau
print("Kendall's tau")
print(x.corr(y, method='kendall'))   

Pearson's r
0.7586402890911867
0.7586402890911869
Spearman's rho
0.9757575757575757
Kendall's tau
0.911111111111111


In [None]:
corr_matrix = df.corr()
display(corr_matrix)

Unnamed: 0,PassengerId,Survived,P_class,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
P_class,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


This example shows two ways of accessing values:

Use **.at[]** to access a single value by row and column labels.  
Use **.iat[]** to access a value by the positions of its row and column.  

In [None]:
corr_matrix.at['Survived', 'Age']

-0.07722109457217756

In [None]:
corr_matrix.iat[2, 4]

0.08308136284568686

## How to work with missing values

Ideally, every row of data will have values for all columns. However, this is rarely the case. Missing values are a reality of machine learning.  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

In [None]:
df.isnull()

Unnamed: 0,PassengerId,Survived,P_class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


**df.isna().sum()** returns the number of missing values in each column.

In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
P_class          0
Name             0
Sex              0
              ... 
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Length: 12, dtype: int64

## Possible options
1. Removing null values
2. Imputation

You have to choose the best option for your case.

### Removing null values

Drop **rows** with null values:  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
df_test1 = df.copy()
df_test1= df_test1.dropna()

In [None]:
df_test1 = df_test1.reset_index()

In [None]:
display(df_test1)

Unnamed: 0,index,PassengerId,Survived,P_class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
2,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
3,10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
4,11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
179,872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
180,879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
181,887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


Parameter **axis** determine if rows or columns which contain missing values are removed.  

**0**, or **index** : Drop rows which contain missing values.  

**1**, or **columns** : Drop columns which contain missing value.  

In [None]:
#df.dropna(axis=1)

### Imputation
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

#### Replace missing values with a scalar

In [None]:
df_test2 = df.copy()
df_test2['Age'] = df_test2['Age'].fillna(25)

#### Replacing With Mean/Median/Mode

One of the practices is to replace missing values with the mean/median/mode value for that column:

In [None]:
df_test3 = df.copy()
median = df_test3['Fare'].median()
df_test3['Fare'] = df_test3['Fare'].fillna(median)

#### Back-fill or Forward-fill

Using method parameter, missing values can be replaced with the values before or after them.

In [None]:
df_test4 = df.copy()
df.fillna(method="ffill")

Unnamed: 0,PassengerId,Survived,P_class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,C85,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,C123,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,C50,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,19.0,1,2,W./C. 6607,23.4500,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


ffill stands for “forward fill” replaces missing values with the values in the previous row. You can also choose bfill which stands for “backward fill”.

## Standardization

In [None]:
# Strip non-numerics
df_2 = df.select_dtypes(include=['int', 'float'])

In [None]:
display(df_2)

Unnamed: 0,PassengerId,Survived,P_class,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [None]:
#standardize
standartized_df=(df_2-df_2.mean())/df_2.std()

#normalize
normalized_df=(df_2-df_2.min())/(df_2.max()-df_2.min())

In [None]:
display(standartized_df)

Unnamed: 0,PassengerId,Survived,P_class,Age,SibSp,Parch,Fare
0,-1.729137,-0.788829,0.826913,-0.530005,0.432550,-0.473408,-0.502163
1,-1.725251,1.266279,-1.565228,0.571430,0.432550,-0.473408,0.786404
2,-1.721365,1.266279,0.826913,-0.254646,-0.474279,-0.473408,-0.488580
3,-1.717480,1.266279,-1.565228,0.364911,0.432550,-0.473408,0.420494
4,-1.713594,-0.788829,0.826913,0.364911,-0.474279,-0.473408,-0.486064
...,...,...,...,...,...,...,...
886,1.713594,-0.788829,-0.369158,-0.185807,-0.474279,-0.473408,-0.386454
887,1.717480,1.266279,-1.565228,-0.736524,-0.474279,-0.473408,-0.044356
888,1.721365,-0.788829,0.826913,,0.432550,2.007806,-0.176164
889,1.725251,1.266279,-1.565228,-0.254646,-0.474279,-0.473408,-0.044356


In [None]:
display(normalized_df)

Unnamed: 0,PassengerId,Survived,P_class,Age,SibSp,Parch,Fare
0,0.000000,0.0,1.0,0.271174,0.125,0.000000,0.014151
1,0.001124,1.0,0.0,0.472229,0.125,0.000000,0.139136
2,0.002247,1.0,1.0,0.321438,0.000,0.000000,0.015469
3,0.003371,1.0,0.0,0.434531,0.125,0.000000,0.103644
4,0.004494,0.0,1.0,0.434531,0.000,0.000000,0.015713
...,...,...,...,...,...,...,...
886,0.995506,0.0,0.5,0.334004,0.000,0.000000,0.025374
887,0.996629,1.0,0.0,0.233476,0.000,0.000000,0.058556
888,0.997753,0.0,1.0,,0.125,0.333333,0.045771
889,0.998876,1.0,0.0,0.321438,0.000,0.000000,0.058556


or

In [None]:
#https://scikit-learn.org/stable/modules/preprocessing.html

#from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
df.iloc[:,3:30] = StandardScaler().fit_transform(df.iloc[:,3:30])
df.head()

ValueError: ignored

## DataFrame slicing, selecting, extracting

In [None]:
#collumns
df = df.iloc[:,0:30]
#rows
df = df.iloc[0:55,:]

### By column

In [None]:
#df.loc[:, 'C':'E']
#is equivalent of
#df[['C', 'D', 'E']] or df.loc[:, ['C', 'D', 'E']]

In [None]:
subset_1 = df[['Fare','Survived']]
display(subset_1)

In [None]:
# Remember that Python does not slice inclusive of the ending index.
subset_2 = df.iloc[:, 0:4]
display(subset_2)

### By row

**.loc** - locates by name  
**.iloc** - locates by numerical index  

In [None]:
#age = df.loc["Age"]

In [None]:
subset_3 = df.iloc[1:5]

In [None]:
display(subset_3)

## Concatenating Rows and Columns
Python can concatenate rows and columns together to form new data frames.

In [None]:
age = df['Age']
fare = df['Fare']
result = pd.concat([age, fare], axis=1)
display(result)

In [None]:
# Create a new dataframe from first 2 rows and last 2 rows
result = pd.concat([df[0:2],df[-2:]], axis=0)
display(result)

## Dropping Fields

Some fields are of no value to the neural network should be dropped.  The following code removes the selected collumn from dataset.  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

In [None]:
df_example_for_drop = df.copy()

print(f"Before drop: {list(df_example_for_drop.columns)}")

df_example_for_drop.drop('Name', 1, inplace=True)

print(f"After drop: {list(df_example_for_drop.columns)}")

display(df_example_for_drop)

## Replace values

In [None]:
df['Sex'].replace("female", 0, inplace=True)
df['Sex'].replace("male", 1, inplace=True)
display(df)

## Training and Validation

### From scratch

In [None]:
import numpy as np
# Usually a good idea to shuffle
df = df.reindex(np.random.permutation(df.index)) 

mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])

print(f"Training DF: {len(trainDF)}")
print(f"Validation DF: {len(validationDF)}")

## Using libraries

#### Split into inputs and outputs

In [None]:
X = df.iloc[:,1:30]  #independent columns
y = df.iloc[:,0]    #target column       
print(X.shape, y.shape)

#### Split into train test sets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Converting a Dataframe to a Matrix
Neural networks do not directly operate on Python data frames.  A neural network requires a numeric matrix.  The program uses the **values** property of a data frame to convert the data to a matrix.

In [None]:
df.values

Convert some of the columns:

In [None]:
df[['Age', 'Survived']].values

In [None]:
X = X.values
y = y.values

# Plotting
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

In [None]:
#!pip install matplotlib

In [None]:
#import matplotlib.pyplot as plt

#set font and plot size to be larger if you want
#plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) 

### Scatter Plot

In [None]:
df.plot(kind='scatter', x='Age', y='Fare', title='Age vs Fare');

### Histogram

In [None]:
df['Age'].plot(kind='hist', title='Age');

### Boxplot

In [None]:
df['Age'].plot(kind="box");

In [None]:
df.boxplot(column='Age', by='Sex');

## Converting dataset back to a CSV

df.to_csv('1_task.csv')