# Data Munging
The term mung is a technical term that was coined about half a century ago by students of
at Massachusetts Institute of Technology (MIT). Munging means to change, in a series of
well-specified and reversible steps, a piece of original data to a completely different (and
hopefully more useful) one.

## Data science process
process can also be described using the acronym OSEMN (Obtain, Scrub, Explore,
Model, iNterpret)

<img src= 'OSEMN.png' style= 'width:700px;height:150px'/>

## Data loading and preprocessing with pandas


### fast and easy data loading: read_csv()

#### Loading data from your computer

In [1]:
import pandas as pd
iris_dataname = "iris/iris.data"
iris_df = pd.read_csv(iris_dataname, sep=',', decimal='.', header=None,
names = ['sepal_length','sepal_width','petal_length','petal_width','target']) # iris_df here is a pandas dataframe
iris_df.head()  # to print the first 5 rows of the dataframe

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


#### Loading data from internet network

In [2]:
#import urllib
#url = 
#set1 = urllib.request.Request(url)
#iris_p = urllib.request.urlopen(set1)
#iris_other = pd.read_csv(iris_p, sep=',', decimal='.',
#header=None, names= ['sepal_length', 'sepal_width',
#petal_length', 'petal_width',
#'target'])
#iris_other.head()

#### Print the first rows

By default case it prints only the first five rows of the dataframe. you can specify the number of first rows you want to see in the parenthesis of the head() method. 

In [3]:
iris_df.head() # default case

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
iris_df.head(4) # 04 first rows 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa


#### Get the last rows 

In [5]:
iris_df.tail() # default case: 5 last rows

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [6]:
iris_df.tail(10) # last 10 rows

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


####  Get  all column names of the dataframe 

In [7]:
iris_column_names = iris_df.columns
print(iris_column_names)
print(type(iris_column_names)) # the result is a pandas index not a list.

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'], dtype='object')
<class 'pandas.core.indexes.base.Index'>


#### Extract columns  by their indexes as series

In [8]:
y = iris_df['target'] # access a column in pandas series form
print(type(y))    # y is of type of pandas series 
y

<class 'pandas.core.series.Series'>


0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: target, Length: 150, dtype: object

In [9]:
z = iris_df.target
z

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: target, Length: 150, dtype: object

#### Extract columns  by their indexes as DataFrame

In [10]:
X = iris_df[['sepal_length','petal_length']]
X

Unnamed: 0,sepal_length,petal_length
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5
4,5.0,1.4
...,...,...
145,6.7,5.2
146,6.3,5.0
147,6.5,5.2
148,6.2,5.4


In [11]:
print(type(X)) # X is of type of Pandas DataFrame

<class 'pandas.core.frame.DataFrame'>


#### Get the dimension of a DataFrame or serie

In [12]:
print(iris_df.shape)

(150, 5)


In [13]:
print(X.shape)

(150, 2)


In [14]:
print(y.shape)

(150,)


In [15]:
iris_df.sepal_width

0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 150, dtype: float64

## Dealing with Problematic data

In [16]:
import pandas as pd
fake_dataset = pd.read_csv("CSV_Files.csv")
fake_dataset

Unnamed: 0,Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
0,20140910,80.0,32.0,40,1
1,20140911,100.0,50.0,36,2
2,20140912,102.0,55.0,46,1
3,20140912,60.0,20.0,35,3
4,20140914,60.0,,32,3
5,20140914,,57.0,42,2


####  Display correctly the date.

In [17]:
fake_dataset = pd.read_csv("CSV_Files.csv", parse_dates = [0]) # To parse well the date
fake_dataset

Unnamed: 0,Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
0,2014-09-10,80.0,32.0,40,1
1,2014-09-11,100.0,50.0,36,2
2,2014-09-12,102.0,55.0,46,1
3,2014-09-12,60.0,20.0,35,3
4,2014-09-14,60.0,,32,3
5,2014-09-14,,57.0,42,2


#### Replace missing values by a meaningfull value

In [18]:
fake_dataset.fillna(50)

Unnamed: 0,Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
0,2014-09-10,80.0,32.0,40,1
1,2014-09-11,100.0,50.0,36,2
2,2014-09-12,102.0,55.0,46,1
3,2014-09-12,60.0,20.0,35,3
4,2014-09-14,60.0,50.0,32,3
5,2014-09-14,50.0,57.0,42,2


**NB**: Note that this method only fills missing values in the view of the data
(that is, it doesn't modify the original DataFrame). In order to actually
change them, use the inplace=True argument command.

#### Replace the missing values by the mean of the DataFrame

In [19]:
avg_value = fake_dataset.mean(axis = 0) # compute the average that spans the rows(results in column-wise)
avg_valueso= fake_dataset.median() # compute the median of the dataframe
fake_dataset.fillna(avg_value)


Unnamed: 0,Date,Temperature_city_1,Temperature_city_2,Temperature_city_3,Which_destination
0,2014-09-10,80.0,32.0,40,1
1,2014-09-11,100.0,50.0,36,2
2,2014-09-12,102.0,55.0,46,1
3,2014-09-12,60.0,20.0,35,3
4,2014-09-14,60.0,42.8,32,3
5,2014-09-14,80.4,57.0,42,2


#### Dealing with badly formatted dataset

Let's load the following file(that contains a bad line) with read_csv in default

In [20]:
bad_line_df = pd.read_csv('bad_format.csv')

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4


This raises to an error. To avoid that errors, we have to ignore the lines causing the exception,by using setting the on_bad_lines (in earlier version of python it is error_bad_lines or warn_bad_lines) parameter to false.

In [21]:
bad_line_df = pd.read_csv('bad_format.csv', on_bad_lines = 'skip')
bad_line_df

Unnamed: 0,Val1,Val2,Val3
0,0,0,0
1,1,1,1
2,3,3,3


## Dealing with big datasets

Sometimes, one uses batch Machine learning algorithm to:
- Take a peek at the data or ;
- Deal with the problem of fitting the size of dataset with the computer's memory,when you load it.

### Data streaming
it consist of load the dataset in chunks.

#### Use of chunksize of pandas module

In [22]:
import pandas as pd
iris_filename = 'iris/iris.data'
iris_chunks = pd.read_csv(iris_filename, names = ['sepal_length','sepal_width', 
                                             'petal_length', 'petal_width','target'],chunksize = 10)
n = 0 
for chunk in iris_chunks:
    n+=1
    print ('Shape:', chunk.shape)
    print (chunk,n)

Shape: (10, 5)
   sepal_length  sepal_width  petal_length  petal_width       target
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
5           5.4          3.9           1.7          0.4  Iris-setosa
6           4.6          3.4           1.4          0.3  Iris-setosa
7           5.0          3.4           1.5          0.2  Iris-setosa
8           4.4          2.9           1.4          0.2  Iris-setosa
9           4.9          3.1           1.5          0.1  Iris-setosa 1
Shape: (10, 5)
    sepal_length  sepal_width  petal_length  petal_width       target
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa

In [23]:
print(type(iris_chunks))  # this object is not a DataFrame but an iterator like object.

<class 'pandas.io.parsers.readers.TextFileReader'>


#### loading big dataset by asking for iterator of pandas module 

In this case, you can
dynamically decide the length (that is, how many lines to get) you want for each piece of
the pandas DataFrame:

In [33]:
import pandas as pd
iris_filename = 'iris/iris.data'
iris_iterator = pd.read_csv(iris_filename, names = ['sepal_length','sepal_width', 
                                             'petal_length', 'petal_width','target'],iterator = True) # we first 
#define the iterator

iris_iterator.get_chunk(10)  # observ a piece of the original DataFrame that contains 10 rows 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [25]:
iris_iterator.get_chunk(5) # The output is a piece of the original data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
10,5.4,3.7,1.5,0.2,Iris-setosa
11,4.8,3.4,1.6,0.2,Iris-setosa
12,4.8,3.0,1.4,0.1,Iris-setosa
13,4.3,3.0,1.1,0.1,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa


In [26]:
print(iris_iterator.get_chunk(5).shape) 

(5, 5)


#### Use of functions of csv package

CSV packages offers 02 functions that helps to chunk data from file. The reader inputs the data from disks to the Python lists. DictReader instead transforms the data into a dictionary.The reader returns exactly what it reads, stripped of the return carriage, and splits
into a list by the separator (which is a comma by default, but this can be modified).
DictReader will map the list's data into a dictionary, whose keys will be defined by the
first row (if a header is present) or the fieldnames parameter (using a list of strings that
reports the column names).

#### Use of DictReader

In [27]:
import csv
with open(iris_filename, 'rt') as data_streaming:
    for n, row in enumerate(csv.DictReader(data_streaming, fieldnames = ['sepal_length','sepal_width',
                                                        'petal_length','petal_width','target'],dialect='excel')):
        if n <=5:
            print(n,row)
        else:
            break
        

0 {'sepal_length': '5.1', 'sepal_width': '3.5', 'petal_length': '1.4', 'petal_width': '0.2', 'target': 'Iris-setosa'}
1 {'sepal_length': '4.9', 'sepal_width': '3.0', 'petal_length': '1.4', 'petal_width': '0.2', 'target': 'Iris-setosa'}
2 {'sepal_length': '4.7', 'sepal_width': '3.2', 'petal_length': '1.3', 'petal_width': '0.2', 'target': 'Iris-setosa'}
3 {'sepal_length': '4.6', 'sepal_width': '3.1', 'petal_length': '1.5', 'petal_width': '0.2', 'target': 'Iris-setosa'}
4 {'sepal_length': '5.0', 'sepal_width': '3.6', 'petal_length': '1.4', 'petal_width': '0.2', 'target': 'Iris-setosa'}
5 {'sepal_length': '5.4', 'sepal_width': '3.9', 'petal_length': '1.7', 'petal_width': '0.4', 'target': 'Iris-setosa'}


**NB**: What does the preceding code accomplish? First, it opens a read-binary connection to the
file that aliases it as data_stream. Using the with command assures that the file is closed
after the commands placed in the preceding indentation are completely executed.
Then, it iterates (for...in) and it enumerates a csv.DictReader call, which wraps the
flow of the data from data_stream. Since we don't have a header row in the file,
fieldnames provides information about the fields' names. dialect just specifies that we
are calling the standard comma-separated CSV (we'll provide some hints on how to modify
this parameter later).

#### Use of reader

In [28]:
with open(iris_filename,'rt') as data_stream:
    for n,row in enumerate(csv.reader(data_stream, dialect = 'excel')):
        if n<=3:
            print(row)
        else:
            break

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']


#### Function  to chunks dataset using reader function of CSV package

In [29]:
def batch_read(filename, batch=5):
    
    # open the data stream
    with open(filename, 'rt') as data_stream:
        # reset the batch
        batch_output = list()
        # iterate over the file
        for n, row in enumerate(csv.reader(data_stream, dialect='excel')):
           # if the batch is of the right size
           if n > 0 and n % batch == 0:
              # yield back the batch as an ndarray
              yield(np.array(batch_output))
              # reset the batch and restart
              batch_output = list()
           #  otherwise add the row to the batch
           batch_output.append(row)
           # when the loop is over, yield what's left
        yield(np.array(batch_output))

In [30]:
import numpy as np
for batch_input in batch_read(iris_filename, batch=3):
    print (batch_input)
    break

[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
 ['4.9' '3.0' '1.4' '0.2' 'Iris-setosa']
 ['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']]


## Loading data from other files format 

Pandas package offers the posibility to load MS Excel, HDFS, SQL, JSON, HTML, and Stata datasets.

Pandas reader that generally returns DataFrame object:

- read_csv
- read_excel
- read_hdf
- read_sql
- read_json
- read_msgpack (experimental)
- read_html
- read_gbq (experimental)
- read_stata
- read_clipboard
- read_pickle

#### Documentation on how to access data format
http://pandas.pydata.org/pandas-docs/version/0.16/io.html.
#### SQlite database
https://www.sqlite.org/index.html
#### Speed up saving and loading data with HDF5
https://support.hdfgroup.org/HDF5/whatishdf5.html
#### How to use h5py package
http://docs.h5py.org/en/stable/

#### Putting data together: function concat of pandas

In [31]:
import pandas as pd

df1 = pd.DataFrame({'Col1': range(5),
'Col2': [1.0]*5,
'Col3': 1.0,
'Col4': 'Hello World!'})
df2 = pd.concat([pd.Series([2,6,8,10,12]),pd.Series([3,5,7,9,14])], ignore_index = True, names = ['Col5','Col6'],axis=1 )
df3 = pd.concat([df1,df2],axis = 1)
display(df2)
display(df3)

Unnamed: 0,0,1
0,2,3
1,6,5
2,8,7
3,10,9
4,12,14


Unnamed: 0,Col1,Col2,Col3,Col4,0,1
0,0,1.0,1.0,Hello World!,2,3
1,1,1.0,1.0,Hello World!,6,5
2,2,1.0,1.0,Hello World!,8,7
3,3,1.0,1.0,Hello World!,10,9
4,4,1.0,1.0,Hello World!,12,14


### Check the type of elements in each columns

We can use dtypes attribut to figure out the type of elements in each column. 

In [34]:
fake_dataset.dtypes

Date                  datetime64[ns]
Temperature_city_1           float64
Temperature_city_2           float64
Temperature_city_3             int64
Which_destination              int64
dtype: object

You can also obtain information about your DataFrame structure and data
types using the info() method 

In [38]:
fake_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                6 non-null      datetime64[ns]
 1   Temperature_city_1  5 non-null      float64       
 2   Temperature_city_2  5 non-null      float64       
 3   Temperature_city_3  6 non-null      float64       
 4   Which_destination   6 non-null      int64         
dtypes: datetime64[ns](1), float64(3), int64(1)
memory usage: 372.0 bytes


We can use astype function to change the type integer/float of a column to another type float/integer

In [36]:
fake_dataset['Temperature_city_3'] = fake_dataset['Temperature_city_3'].astype(float)

In [37]:
fake_dataset.dtypes


Date                  datetime64[ns]
Temperature_city_1           float64
Temperature_city_2           float64
Temperature_city_3           float64
Which_destination              int64
dtype: object

## Data preprocesing
#### Creating a mask
A mask is a pandas serie of boolean values telling you whether a line is selected or not.

In [39]:
mask_feature = iris_df['sepal_length']>6
mask_feature

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149    False
Name: sepal_length, Length: 150, dtype: bool

#### substitute the target label width by another label

In [43]:
mask_target = iris_df['target'] == 'Iris-setosa'
iris_df.loc[mask_target, 'target']= "new_label"
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,new_label
1,4.9,3.0,1.4,0.2,new_label
2,4.7,3.2,1.3,0.2,new_label
3,4.6,3.1,1.5,0.2,new_label
4,5.0,3.6,1.4,0.2,new_label
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


#### To see the list of label

In [44]:
iris_df['target'].unique()

array(['new_label', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

#### To compute some statistics by using groupby() method

*Computing the average of each column grouped by targets*

In [46]:
grouped_target_mean = iris_df.groupby(['target']).mean()
grouped_target_mean

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-versicolor,5.936,2.77,4.26,1.326
Iris-virginica,6.588,2.974,5.552,2.026
new_label,5.006,3.418,1.464,0.244


*Computing the variance of each column grouped by targets*

In [47]:
grouped_target_var = iris_df.groupby(['target']).var()
grouped_target_var

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-versicolor,0.266433,0.098469,0.220816,0.039106
Iris-virginica,0.404343,0.104004,0.304588,0.075433
new_label,0.124249,0.14518,0.030106,0.011494


#### Compute multiple statistics in one time with agg method

In [49]:
funcs = {'sepal_length':['min','max','std','mean'],'sepal_width':['min','max','std','mean'],
         "petal_length":['min','max','std','mean'],'petal_width':['min','max','std','mean']}

grouped_target_aggregate = iris_df.groupby(['target']).agg(funcs)
grouped_target_aggregate

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_length,sepal_width,sepal_width,sepal_width,sepal_width,petal_length,petal_length,petal_length,petal_length,petal_width,petal_width,petal_width,petal_width
Unnamed: 0_level_1,min,max,std,mean,min,max,std,mean,min,max,std,mean,min,max,std,mean
target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Iris-versicolor,4.9,7.0,0.516171,5.936,2.0,3.4,0.313798,2.77,3.0,5.1,0.469911,4.26,1.0,1.8,0.197753,1.326
Iris-virginica,4.9,7.9,0.63588,6.588,2.2,3.8,0.322497,2.974,4.5,6.9,0.551895,5.552,1.4,2.5,0.27465,2.026
new_label,4.3,5.8,0.35249,5.006,2.3,4.4,0.381024,3.418,1.0,1.9,0.173511,1.464,0.1,0.6,0.10721,0.244


#### To sort different informations from the dataframe

In [54]:
iris_df.sort_index().head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,new_label
1,4.9,3.0,1.4,0.2,new_label
2,4.7,3.2,1.3,0.2,new_label
3,4.6,3.1,1.5,0.2,new_label
4,5.0,3.6,1.4,0.2,new_label


#### Use of apply() method to operate row/column_wise

More generically, the apply() pandas method is able to perform any row-wise or column-
wise operation programmatically

*counting the number of non zero element in each line*

In [57]:
apply1 = iris_df.apply(np.count_nonzero, axis=1).head()
apply1

0    5
1    5
2    5
3    5
4    5
dtype: int64

*counting the number of non zero element in each line*

In [58]:
apply2 = iris_df.apply(np.count_nonzero, axis = 0)
apply2

sepal_length    150
sepal_width     150
petal_length    150
petal_width     150
target          150
dtype: int64

*Apply method by using another variable*

In [80]:
def square(x):
    return x**2
iris_column_names = iris_df.columns
type(iris_column_names)
iris_column_names = iris_column_names.delete(len(iris_column_names)-1)
iris_column_names
square_iris = iris_df[iris_column_names].apply(square)
square_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,26.01,12.25,1.96,0.04
1,24.01,9.00,1.96,0.04
2,22.09,10.24,1.69,0.04
3,21.16,9.61,2.25,0.04
4,25.00,12.96,1.96,0.04
...,...,...,...,...
145,44.89,9.00,27.04,5.29
146,39.69,6.25,25.00,3.61
147,42.25,9.00,27.04,4.00
148,38.44,11.56,29.16,5.29


#### Use of applymap() method to operate element_wise
*length of string representation of each cells*

In [60]:
applymapp = iris_df.applymap(lambda x: len(str(x))).head()
applymapp

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,3,3,3,3,9
1,3,3,3,3,9
2,3,3,3,3,9
3,3,3,3,3,9
4,3,3,3,3,9


## Data Selection 

#### Importation of Dataset+ Indexing dataframe

In [81]:
import pandas as pd 
df = pd.read_csv('/home/fabrice/Desktop/Data_sciences_projects/new_dataset.csv',index_col = 0)
df

Unnamed: 0_level_0,val1,val2,val3
n,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,10,10,C
101,10,20,C
102,10,30,B
103,10,40,B
104,10,50,A


#### Select element in DataFrame

In [82]:
df['val3'][103] # select with []

'B'

In [83]:
df.loc[103, "val3"] # select with loc

'B'

In [84]:
df.iloc[3,2]  # Select with iloc() me 

'B'

#### Retreival of sub matrices

In [85]:
df[['val2','val3']][0:2]

Unnamed: 0_level_0,val2,val3
n,Unnamed: 1_level_1,Unnamed: 2_level_1
100,10,C
101,20,C


In [89]:
df.loc[range(100,102), ['val2','val3']]

Unnamed: 0_level_0,val2,val3
n,Unnamed: 1_level_1,Unnamed: 2_level_1
100,10,C
101,20,C


In [91]:
df.iloc[range(2),[1,2]]

Unnamed: 0_level_0,val2,val3
n,Unnamed: 1_level_1,Unnamed: 2_level_1
100,10,C
101,20,C


## Dealing with textual data
There are 02 main types of data: categorical and numerical data.

-numerical data: data that have float types and work with binary operators(less than, greater than, equals to).
-Categorical data: data that take values from a finite or infinite set but doesn't work with binary operators.

In addition to these data is booleans(they can be classified as categorical data), which represents the probability of a feature having an exhibit.

Boolean features are often used to encode categorical features as numerical values.