Before we start, we import some useful libraries. 

In [3]:
import pandas as pd

# I - DATA PREPROCESSING

## 1 - Loading data

In [7]:
cereals = pd.read_csv('cereals.csv')
cereals.head()

Unnamed: 0,Name,Manuf,Type,Calories,Protein,Fat,Sodium,Fiber,Carbo,Sugars,...,Weight,Cups,Rating,Cold,Nabisco,Quaker,Kelloggs,GeneralMills,Ralston,AHFP
0,100%_Bran,N,C,70,4,1,130,10.0,5.0,6.0,...,1.0,0.33,68.402973,1,1,0,0,0,0,0
1,100%_Natural_Bran,Q,C,120,3,5,15,2.0,8.0,8.0,...,1.0,1.0,33.983679,1,0,1,0,0,0,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5.0,...,1.0,0.33,59.425505,1,0,0,1,0,0,0
3,All-Bran_with_Extra_Fiber,K,C,50,4,0,140,14.0,8.0,0.0,...,1.0,0.5,93.704912,1,0,0,1,0,0,0
4,Almond_Delight,R,C,110,2,2,200,1.0,14.0,8.0,...,1.0,0.75,34.384843,1,0,0,0,0,1,0


In [14]:
var = cereals.columns.tolist()
print(var)

['Name', 'Manuf', 'Type', 'Calories', 'Protein', 'Fat', 'Sodium', 'Fiber', 'Carbo', 'Sugars', 'Potass', 'Vitamins', 'Shelf', 'Weight', 'Cups', 'Rating', 'Cold', 'Nabisco', 'Quaker', 'Kelloggs', 'GeneralMills', 'Ralston', 'AHFP']


In [17]:
print(cereals.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 23 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          77 non-null     object 
 1   Manuf         77 non-null     object 
 2   Type          77 non-null     object 
 3   Calories      77 non-null     int64  
 4   Protein       77 non-null     int64  
 5   Fat           77 non-null     int64  
 6   Sodium        77 non-null     int64  
 7   Fiber         77 non-null     float64
 8   Carbo         76 non-null     float64
 9   Sugars        76 non-null     float64
 10  Potass        75 non-null     float64
 11  Vitamins      77 non-null     int64  
 12  Shelf         77 non-null     int64  
 13  Weight        77 non-null     float64
 14  Cups          77 non-null     float64
 15  Rating        77 non-null     float64
 16  Cold          77 non-null     int64  
 17  Nabisco       77 non-null     int64  
 18  Quaker        77 non-null     in

It's clear that cereals data set contains some missing values (Sugars, Potass, Carbo have less than 77 non-null values). In the next section I will try to verify this.

## 2 - Creating Dependant and Independant variable vectors 

Note that most of data cleaning & transformation concerne the Independant variable, so it's crucial to split our dataset into Dependant & Independant variable.

In [47]:
X = cereals.iloc[:,:-1] # the Independant variable
Y = cereals.iloc[:,-1]  # the Dependant variable

## 3 - Dealing with missing values

Here, I will write a simple code that checks the existence of missing values and correct them. 
* For categorical variables, missing values should be replaced with the mode. For this, I am going to define a function that returns the mode of a categorical feature.

In [30]:
def mode(feature) :
    seen = dict() # seen['a_features's_value'] = the number of occurences of this value
    n = len(feature) 
    for i in range(n) : 
        count = 1
        if feature[i]  in seen.keys() :  # so i used a dictionary to be able to do this test and store the number of occurences at once 
            continue
        else :
            for j in range(i+1,n) :
                if feature[j] == feature[i] : 
                    count+=1
            seen[feature[i]] = count # the n
    indice_mode = list(seen.values()).index(max(seen.values()))
    return list(seen.keys())[indice_mode]
    
print(mode(['a','b','c','a','e','f','a'])) 

a


* For numeric features, i will simply replace the missing values with the mean of the feature. I am going to define a function that returns the mean of a numeric feature.

In [29]:
def mean(feature) : 
    return sum(feature) / len(feature)
print(mean([1,2,4,2,45,2,9,0,4,2,7,10,2]))    

6.923076923076923


At this stage, we are able to write the main code of this section that is going to deal with the missing values :)

In [39]:
print(cereals.Name.dtype == 'object')

True


In [55]:
X2 = pd.DataFrame(X, copy=True)

In [50]:
features_name = X.columns

for f_name in features_name :
    f_values = cereals[f_name]
    l = 77      # since we have 77 rows :)
    for i in range(l):
        if f_values[i] == 'NaN' : 
            if cereals.f_name.dtype == 'object' : f_values[i] = mode(f_values)
            else : f_values[i] = mean(f_values)
        else : continue

In [51]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          77 non-null     object 
 1   Manuf         77 non-null     object 
 2   Type          77 non-null     object 
 3   Calories      77 non-null     int64  
 4   Protein       77 non-null     int64  
 5   Fat           77 non-null     int64  
 6   Sodium        77 non-null     int64  
 7   Fiber         77 non-null     float64
 8   Carbo         76 non-null     float64
 9   Sugars        76 non-null     float64
 10  Potass        75 non-null     float64
 11  Vitamins      77 non-null     int64  
 12  Shelf         77 non-null     int64  
 13  Weight        77 non-null     float64
 14  Cups          77 non-null     float64
 15  Rating        77 non-null     float64
 16  Cold          77 non-null     int64  
 17  Nabisco       77 non-null     int64  
 18  Quaker        77 non-null     in

In [58]:
def new_column(column) :
    if 'NaN' not in column : return column
    else : 
        for i in range(len(column)) :
            if column[i] == 'NaN' :
                column[i] == mean(column)
    return column
    

In [61]:
T = cereals['Carbo']
X2['Carbo'] = new_column(T)
X2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 22 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          77 non-null     object 
 1   Manuf         77 non-null     object 
 2   Type          77 non-null     object 
 3   Calories      77 non-null     int64  
 4   Protein       77 non-null     int64  
 5   Fat           77 non-null     int64  
 6   Sodium        77 non-null     int64  
 7   Fiber         77 non-null     float64
 8   Carbo         76 non-null     float64
 9   Sugars        76 non-null     float64
 10  Potass        75 non-null     float64
 11  Vitamins      77 non-null     int64  
 12  Shelf         77 non-null     int64  
 13  Weight        77 non-null     float64
 14  Cups          77 non-null     float64
 15  Rating        77 non-null     float64
 16  Cold          77 non-null     int64  
 17  Nabisco       77 non-null     int64  
 18  Quaker        77 non-null     in