# Check for basic stats and clean the dataset
by Smahi

## Scope
- Check for null values in every column.
- Check for basic stats for the numeric columns.
- Filter the unnamed column if it is irrelevant.
- Save the processed data in a new csv

## Summary
- Changed the name of the columns in more easy to use snake_case.
- Droped the Unnamed:12 column as it was the replica of Output column.

## Imports

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Read csv
df = pd.read_csv("C:/Users/SMAHI/Desktop/Online-food-delivery/Data/onlinefood.csv")

In [3]:
# Preview
df.head()

Unnamed: 0,Age,Gender,Marital Status,Occupation,Monthly Income,Educational Qualifications,Family size,latitude,longitude,Pin code,Output,Feedback,Unnamed: 12
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Yes,Positive,Yes
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Yes,Positive,Yes
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Yes,Negative,Yes
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Yes,Positive,Yes
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Yes,Positive,Yes


In [4]:
# Shape
df.shape

(388, 13)

In [5]:
# Check for col names
df.columns

Index(['Age', 'Gender', 'Marital Status', 'Occupation', 'Monthly Income',
       'Educational Qualifications', 'Family size', 'latitude', 'longitude',
       'Pin code', 'Output', 'Feedback', 'Unnamed: 12'],
      dtype='object')

In [6]:
# Check for null values
df.isnull().sum()

Age                           0
Gender                        0
Marital Status                0
Occupation                    0
Monthly Income                0
Educational Qualifications    0
Family size                   0
latitude                      0
longitude                     0
Pin code                      0
Output                        0
Feedback                      0
Unnamed: 12                   0
dtype: int64

There are no null values

In [7]:
# Check the data type for all columns
df.dtypes

Age                             int64
Gender                         object
Marital Status                 object
Occupation                     object
Monthly Income                 object
Educational Qualifications     object
Family size                     int64
latitude                      float64
longitude                     float64
Pin code                        int64
Output                         object
Feedback                       object
Unnamed: 12                    object
dtype: object

In [8]:
# Use describe for basic stats
df.describe()

Unnamed: 0,Age,Family size,latitude,longitude,Pin code
count,388.0,388.0,388.0,388.0,388.0
mean,24.628866,3.280928,12.972058,77.60016,560040.113402
std,2.975593,1.351025,0.044489,0.051354,31.399609
min,18.0,1.0,12.8652,77.4842,560001.0
25%,23.0,2.0,12.9369,77.565275,560010.75
50%,24.0,3.0,12.977,77.5921,560033.5
75%,26.0,4.0,12.997025,77.6309,560068.0
max,33.0,6.0,13.102,77.7582,560109.0


- Renaming column names to snake_case format

In [9]:
# Rename the columns in snake_format
df.rename(columns={'Marital Status': 'Marital_status',
                   'Monthly Income': 'Monthly_income',
                   'Educational Qualifications': 'Education',
                   'Family size':'Family_size',
                  'Pin code':'Pin_code'}, inplace=True)

In [10]:
# Preview
df.tail()

Unnamed: 0,Age,Gender,Marital_status,Occupation,Monthly_income,Education,Family_size,latitude,longitude,Pin_code,Output,Feedback,Unnamed: 12
383,23,Female,Single,Student,No Income,Post Graduate,2,12.9766,77.5993,560001,Yes,Positive,Yes
384,23,Female,Single,Student,No Income,Post Graduate,4,12.9854,77.7081,560048,Yes,Positive,Yes
385,22,Female,Single,Student,No Income,Post Graduate,5,12.985,77.5533,560010,Yes,Positive,Yes
386,23,Male,Single,Student,Below Rs.10000,Post Graduate,2,12.977,77.5773,560009,Yes,Positive,Yes
387,23,Male,Single,Student,No Income,Post Graduate,5,12.8988,77.5764,560078,Yes,Positive,Yes


In [11]:

df['Unnamed: 12'].value_counts()

Yes    301
No      87
Name: Unnamed: 12, dtype: int64

In [12]:
df['Output'].value_counts()

Yes    301
No      87
Name: Output, dtype: int64

In [13]:
# Check if 'Column1' is a replica of 'Column2'
is_replica = (df['Unnamed: 12'] == df['Output']).all()

In [14]:
is_replica

True

In [15]:
# Drop column Unnamed:12
df.drop('Unnamed: 12',axis=1,inplace=True)

In [16]:
# Preview
df.head()

Unnamed: 0,Age,Gender,Marital_status,Occupation,Monthly_income,Education,Family_size,latitude,longitude,Pin_code,Output,Feedback
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Yes,Positive
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Yes,Positive
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Yes,Negative
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Yes,Positive
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Yes,Positive


- It can be noticed that there are multiple categorical columns in the dataset. So printing those categories

In [17]:
# Print categories with count
df.Gender.value_counts()

Male      222
Female    166
Name: Gender, dtype: int64

In [18]:
# Print categories with count
df.Marital_status.value_counts()

Single               268
Married              108
Prefer not to say     12
Name: Marital_status, dtype: int64

In [19]:
# Print categories with count
df.Occupation.value_counts()

Student           207
Employee          118
Self Employeed     54
House wife          9
Name: Occupation, dtype: int64

In [20]:
# Print categories with count
df.Monthly_income.value_counts()

No Income          187
25001 to 50000      69
More than 50000     62
10001 to 25000      45
Below Rs.10000      25
Name: Monthly_income, dtype: int64

In [21]:
# Print categories with count
df.Education.value_counts()

Graduate         177
Post Graduate    174
Ph.D              23
School            12
Uneducated         2
Name: Education, dtype: int64

In [22]:
# Print categories with count
df.Family_size.value_counts()

3    117
2    101
4     63
5     54
6     29
1     24
Name: Family_size, dtype: int64

In [23]:
# Print categories with count
df.Feedback.value_counts()

Positive     317
Negative      71
Name: Feedback, dtype: int64

In [24]:
# Print categories with count
df.Output.value_counts()

Yes    301
No      87
Name: Output, dtype: int64

In [25]:
# Save the DataFrame to a new CSV file
new_csv_path = r'C:/Users/SMAHI/Desktop/Online-food-delivery/Data/clean_data.csv'
df.to_csv(new_csv_path, index=False)

In [26]:
# Confirm that the DataFrame has been saved
print(f"DataFrame has been saved to {new_csv_path}")

DataFrame has been saved to C:/Users/SMAHI/Desktop/Online-food-delivery/Data/clean_data.csv
