# Chapter 3: AI Short introduction

... Theory ...

## Datasets for machine learning (pandas library)

Using the infamous iris.csv dataset to introduce the pandas library.

In [1]:
import pandas as pd
import os

path = os.path.join(os.getcwd() , "data/iris.csv")
df = pd.read_csv(path)

`head()`, `tail()` and `describe()` to have a look at the data

In [2]:
print(df.head())        # first 5 lines
print(df.tail())        # last 5 lines
print(df.describe())    # some stats

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
     sepal.length  sepal.width  petal.length  petal.width    variety
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0  Virginica
148           6.2          3.4           5.4          2.3  Virginica
149           5.9          3.0           5.1          1.8  Virginica
       sepal.length  sepal.width  petal.length  petal.width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066   

Some functions to inspect data values

In [3]:
print(df.dtypes)
print(df.index)
print(df.columns)
print(df.values)

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object
RangeIndex(start=0, stop=150, step=1)
Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')
[[5.1 3.5 1.4 0.2 'Setosa']
 [4.9 3.0 1.4 0.2 'Setosa']
 [4.7 3.2 1.3 0.2 'Setosa']
 [4.6 3.1 1.5 0.2 'Setosa']
 [5.0 3.6 1.4 0.2 'Setosa']
 [5.4 3.9 1.7 0.4 'Setosa']
 [4.6 3.4 1.4 0.3 'Setosa']
 [5.0 3.4 1.5 0.2 'Setosa']
 [4.4 2.9 1.4 0.2 'Setosa']
 [4.9 3.1 1.5 0.1 'Setosa']
 [5.4 3.7 1.5 0.2 'Setosa']
 [4.8 3.4 1.6 0.2 'Setosa']
 [4.8 3.0 1.4 0.1 'Setosa']
 [4.3 3.0 1.1 0.1 'Setosa']
 [5.8 4.0 1.2 0.2 'Setosa']
 [5.7 4.4 1.5 0.4 'Setosa']
 [5.4 3.9 1.3 0.4 'Setosa']
 [5.1 3.5 1.4 0.3 'Setosa']
 [5.7 3.8 1.7 0.3 'Setosa']
 [5.1 3.8 1.5 0.3 'Setosa']
 [5.4 3.4 1.7 0.2 'Setosa']
 [5.1 3.7 1.5 0.4 'Setosa']
 [4.6 3.6 1.0 0.2 'Setosa']
 [5.1 3.3 1.7 0.5 'Setosa']
 [4.8 3.4 1.9 0.2 'Setosa']
 [5.0 3.0 1.6 0.2 

Data can also be sorted

In [4]:
df2 = df.sort_values('sepal.width',ascending=False) # does not sort in-place
print(df2)

     sepal.length  sepal.width  petal.length  petal.width     variety
15            5.7          4.4           1.5          0.4      Setosa
33            5.5          4.2           1.4          0.2      Setosa
32            5.2          4.1           1.5          0.1      Setosa
14            5.8          4.0           1.2          0.2      Setosa
16            5.4          3.9           1.3          0.4      Setosa
..            ...          ...           ...          ...         ...
87            6.3          2.3           4.4          1.3  Versicolor
62            6.0          2.2           4.0          1.0  Versicolor
68            6.2          2.2           4.5          1.5  Versicolor
119           6.0          2.2           5.0          1.5   Virginica
60            5.0          2.0           3.5          1.0  Versicolor

[150 rows x 5 columns]


Slicing data frames with variable names

In [5]:
print(df[['sepal.width']]) # slice one column by name
print(df[['sepal.width','sepal.length']]) # slice two columns by name

     sepal.width
0            3.5
1            3.0
2            3.2
3            3.1
4            3.6
..           ...
145          3.0
146          2.5
147          3.0
148          3.4
149          3.0

[150 rows x 1 columns]
     sepal.width  sepal.length
0            3.5           5.1
1            3.0           4.9
2            3.2           4.7
3            3.1           4.6
4            3.6           5.0
..           ...           ...
145          3.0           6.7
146          2.5           6.3
147          3.0           6.5
148          3.4           6.2
149          3.0           5.9

[150 rows x 2 columns]


Slicing by rows using index

In [6]:
print(df[2:4]) # slice rows by index, exclusive

   sepal.length  sepal.width  petal.length  petal.width variety
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa


Slicing by rows and columns at the same time uses the functions `loc()` or `iloc()`

In [7]:
print(df.loc[2:4,['petal.width','petal.length']]) # slice rows by index and columns by name
print(df.iloc[2:4,[0,1]]) # slice row and columns by index

   petal.width  petal.length
2          0.2           1.3
3          0.2           1.5
4          0.2           1.4
   sepal.length  sepal.width
2           4.7          3.2
3           4.6          3.1


We can filter data using logical conditions or `isin()` function.

In [8]:
print(df[df['sepal.width']>3]) # slicing with logical condition
print(df[df['variety'].isin(["Setosa"])])

     sepal.length  sepal.width  petal.length  petal.width    variety
0             5.1          3.5           1.4          0.2     Setosa
2             4.7          3.2           1.3          0.2     Setosa
3             4.6          3.1           1.5          0.2     Setosa
4             5.0          3.6           1.4          0.2     Setosa
5             5.4          3.9           1.7          0.4     Setosa
..            ...          ...           ...          ...        ...
140           6.7          3.1           5.6          2.4  Virginica
141           6.9          3.1           5.1          2.3  Virginica
143           6.8          3.2           5.9          2.3  Virginica
144           6.7          3.3           5.7          2.5  Virginica
148           6.2          3.4           5.4          2.3  Virginica

[67 rows x 5 columns]
    sepal.length  sepal.width  petal.length  petal.width variety
0            5.1          3.5           1.4          0.2  Setosa
1            4.9   

We can create new column of data

In [9]:
df["sepal.area"] = df['sepal.length'] * df['sepal.width']
df['zeros'] = 0.0
print(df)

     sepal.length  sepal.width  petal.length  petal.width    variety  \
0             5.1          3.5           1.4          0.2     Setosa   
1             4.9          3.0           1.4          0.2     Setosa   
2             4.7          3.2           1.3          0.2     Setosa   
3             4.6          3.1           1.5          0.2     Setosa   
4             5.0          3.6           1.4          0.2     Setosa   
..            ...          ...           ...          ...        ...   
145           6.7          3.0           5.2          2.3  Virginica   
146           6.3          2.5           5.0          1.9  Virginica   
147           6.5          3.0           5.2          2.0  Virginica   
148           6.2          3.4           5.4          2.3  Virginica   
149           5.9          3.0           5.1          1.8  Virginica   

     sepal.area  zeros  
0         17.85    0.0  
1         14.70    0.0  
2         15.04    0.0  
3         14.26    0.0  
4         

We can also remove column with `drop()`

In [10]:
df = df.drop(['zeros'],axis=1)
print("df after drop",df)

df after drop      sepal.length  sepal.width  petal.length  petal.width    variety  \
0             5.1          3.5           1.4          0.2     Setosa   
1             4.9          3.0           1.4          0.2     Setosa   
2             4.7          3.2           1.3          0.2     Setosa   
3             4.6          3.1           1.5          0.2     Setosa   
4             5.0          3.6           1.4          0.2     Setosa   
..            ...          ...           ...          ...        ...   
145           6.7          3.0           5.2          2.3  Virginica   
146           6.3          2.5           5.0          1.9  Virginica   
147           6.5          3.0           5.2          2.0  Virginica   
148           6.2          3.4           5.4          2.3  Virginica   
149           5.9          3.0           5.1          1.8  Virginica   

     sepal.area  
0         17.85  
1         14.70  
2         15.04  
3         14.26  
4         18.00  
..          .

We can rename columns

In [11]:
df.rename(columns = {'sepal.area':'sep.ar'},inplace=True)
print(df.head())
df.columns = ['col1','col2','col3','col4','col5','col6']
print(df.head())

   sepal.length  sepal.width  petal.length  petal.width variety  sep.ar
0           5.1          3.5           1.4          0.2  Setosa   17.85
1           4.9          3.0           1.4          0.2  Setosa   14.70
2           4.7          3.2           1.3          0.2  Setosa   15.04
3           4.6          3.1           1.5          0.2  Setosa   14.26
4           5.0          3.6           1.4          0.2  Setosa   18.00
   col1  col2  col3  col4    col5   col6
0   5.1   3.5   1.4   0.2  Setosa  17.85
1   4.9   3.0   1.4   0.2  Setosa  14.70
2   4.7   3.2   1.3   0.2  Setosa  15.04
3   4.6   3.1   1.5   0.2  Setosa  14.26
4   5.0   3.6   1.4   0.2  Setosa  18.00


To add a new row, we have to create a serie first with `Series()` before appending to the DataFrame with the last position of `loc[]`. (Note that `append()` has been deprecated in recent version of pandas)

In [12]:
to_append = [7.0,4.0,5.5,6.6,"Setosa",28.0]
a_series = pd.Series(to_append, index = df.columns)

df.loc[len(df)] = a_series
print(df)

     col1  col2  col3  col4       col5   col6
0     5.1   3.5   1.4   0.2     Setosa  17.85
1     4.9   3.0   1.4   0.2     Setosa  14.70
2     4.7   3.2   1.3   0.2     Setosa  15.04
3     4.6   3.1   1.5   0.2     Setosa  14.26
4     5.0   3.6   1.4   0.2     Setosa  18.00
..    ...   ...   ...   ...        ...    ...
146   6.3   2.5   5.0   1.9  Virginica  15.75
147   6.5   3.0   5.2   2.0  Virginica  19.50
148   6.2   3.4   5.4   2.3  Virginica  21.08
149   5.9   3.0   5.1   1.8  Virginica  17.70
150   7.0   4.0   5.5   6.6     Setosa  28.00

[151 rows x 6 columns]


We can loop over the data frame with `iterrows()`

In [13]:
for ind, row in df.iterrows():
    print("Line {} : {}".format(ind, row['col2']))
    if ind==5 : break # display only first 5

Line 0 : 3.5
Line 1 : 3.0
Line 2 : 3.2
Line 3 : 3.1
Line 4 : 3.6
Line 5 : 3.9


And finally save data frame back to CSV using `to_csv()`

In [14]:
df.to_csv("iris_new.csv")