# Cleaning the Data and Exploring 

This part explores the data, and apply the following methods to ensure the integrity of it:
- **Handle Missing Values:** Techniques include deletion, imputation (filling with mean, median, or mode), or using algorithms that can handle missing data.
- **Correct Inaccuracies:** Identify incorrect data points and correct them manually or automatically.
- **Standardize Data:** Ensure consistency in data formats (e.g., dates, phone numbers).
- **Remove Irrelevant Data:** Exclude data that is not necessary for the analysis to reduce noise.

In [1]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("oral-health-uofg.csv")
df.head()

Unnamed: 0,طابع زمني,Are you a student at University of Gezira?,Age,Gender,Faculty,Academic level,Causes of dental caries,Causes of bleeding during tooth brushing:,Influence of dental plaque:,Measures that prevent oral diseases:,Systemic diseases that may be related to oral diseases:,Which is more important for oral health: self-administration or dentist?:,Frequency of daily tooth brushing.,Duration of tooth brushing.,Frequency of replacing tooth brush.,Frequency of visiting dentist.,Method of tooth brushing:,Oral hygiene methods besides tooth brushing:,What oral health problem(s) do you have (you can choose more than one):
0,9/24/2023 22:01:24,Yes,23,Male,Medicine,Fourth year,"Toothpaste without fluoride, Frequent ingestio...",Brushing too hard,Affecting appearance,Application of fluoride,Heart diseases,Self-administration of oral hygiene,1,2 to 3 minutes,Every three months,When I have a dental problem/oral disease,Irregular,Mouthwash,Abnormal growth of last molar in the left side...
1,9/24/2023 22:04:16,Yes,25,Male,Medicine,Fifth year,Dynamic of oral micro-flora,Periodontal disease,Affecting appearance,Application of fluoride,Heart diseases,Self-administration of oral hygiene,2,More than 3 minutes,Every three months,Twice a year,Horizontal scrub,Mouthwash,Dental caries
2,9/24/2023 22:05:24,Yes,24,Male,Dentistry,Fourth year,"Frequent ingestion of sugar, Dynamic of oral m...","Natural physiological phenomenon, Brushing too...",Don't know,"Application of fluoride, Pit & fissure sealing","None of the above, Other diseases",Regular visit to dentist,2,2 to 3 minutes,,Twice a year,Modified pass technique,Toothpick,No problem
3,9/24/2023 22:09:19,Yes,25,Male,Medicine,Fifth year,"Frequent ingestion of sugar, Inadequate tooth ...","Periodontal disease, Brushing too hard, System...","Affecting appearance, Inducing dental caries, ...","Application of fluoride, Tooth scaling","Heart diseases, Diabetes mellitus, Hypertension",Self-administration of oral hygiene,1,More than 3 minutes,Every three months,Once a year,Vertical scrub,Mouthwash,Toothache
4,9/24/2023 22:09:26,Yes,20,Male,Medicine,Third year,"Frequent ingestion of sugar, Dynamic of oral m...","Brushing too hard, Systemic disease","Affecting appearance, Inducing dental caries",Application of fluoride,None of the above,Regular visit to dentist,1,2 to 3 minutes,Until it can't be used,When I have a dental problem/oral disease,Horizontal scrub,Sugar-free chewing gum,Toothache


In [3]:
df.reset_index(inplace=True)
df.rename(columns={"index": "RecordID"}, inplace=True)
df.head()

Unnamed: 0,RecordID,طابع زمني,Are you a student at University of Gezira?,Age,Gender,Faculty,Academic level,Causes of dental caries,Causes of bleeding during tooth brushing:,Influence of dental plaque:,Measures that prevent oral diseases:,Systemic diseases that may be related to oral diseases:,Which is more important for oral health: self-administration or dentist?:,Frequency of daily tooth brushing.,Duration of tooth brushing.,Frequency of replacing tooth brush.,Frequency of visiting dentist.,Method of tooth brushing:,Oral hygiene methods besides tooth brushing:,What oral health problem(s) do you have (you can choose more than one):
0,0,9/24/2023 22:01:24,Yes,23,Male,Medicine,Fourth year,"Toothpaste without fluoride, Frequent ingestio...",Brushing too hard,Affecting appearance,Application of fluoride,Heart diseases,Self-administration of oral hygiene,1,2 to 3 minutes,Every three months,When I have a dental problem/oral disease,Irregular,Mouthwash,Abnormal growth of last molar in the left side...
1,1,9/24/2023 22:04:16,Yes,25,Male,Medicine,Fifth year,Dynamic of oral micro-flora,Periodontal disease,Affecting appearance,Application of fluoride,Heart diseases,Self-administration of oral hygiene,2,More than 3 minutes,Every three months,Twice a year,Horizontal scrub,Mouthwash,Dental caries
2,2,9/24/2023 22:05:24,Yes,24,Male,Dentistry,Fourth year,"Frequent ingestion of sugar, Dynamic of oral m...","Natural physiological phenomenon, Brushing too...",Don't know,"Application of fluoride, Pit & fissure sealing","None of the above, Other diseases",Regular visit to dentist,2,2 to 3 minutes,,Twice a year,Modified pass technique,Toothpick,No problem
3,3,9/24/2023 22:09:19,Yes,25,Male,Medicine,Fifth year,"Frequent ingestion of sugar, Inadequate tooth ...","Periodontal disease, Brushing too hard, System...","Affecting appearance, Inducing dental caries, ...","Application of fluoride, Tooth scaling","Heart diseases, Diabetes mellitus, Hypertension",Self-administration of oral hygiene,1,More than 3 minutes,Every three months,Once a year,Vertical scrub,Mouthwash,Toothache
4,4,9/24/2023 22:09:26,Yes,20,Male,Medicine,Third year,"Frequent ingestion of sugar, Dynamic of oral m...","Brushing too hard, Systemic disease","Affecting appearance, Inducing dental caries",Application of fluoride,None of the above,Regular visit to dentist,1,2 to 3 minutes,Until it can't be used,When I have a dental problem/oral disease,Horizontal scrub,Sugar-free chewing gum,Toothache


In [4]:
df = df[
    df["Are you a student at University of Gezira?"] == "Yes"
]  # Filtering all students who are NOT students at UofG

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 401 entries, 0 to 407
Data columns (total 20 columns):
 #   Column                                                                     Non-Null Count  Dtype 
---  ------                                                                     --------------  ----- 
 0   RecordID                                                                   401 non-null    int64 
 1   طابع زمني                                                                  401 non-null    object
 2   Are you a student at University of Gezira?                                 401 non-null    object
 3   Age                                                                        396 non-null    object
 4   Gender                                                                     400 non-null    object
 5   Faculty                                                                    401 non-null    object
 6   Academic level                                                         

Most columns that are missing are considerably low in number which can handle filling it with the mode. Except for "Oral hygiene methods besides tooth brushing" which probably indicate "I use nothing else" so this will be filled with "None".

In [6]:
def clean_age(val):
    try:
        return int(val)  # works if val is already a number or numeric string
    except:
        try:
            return int(val[:2])
        except:
            return np.nan


df["Age"] = df["Age"].apply(clean_age)
print(df["Age"])

0      23.0
1      25.0
2      24.0
3      25.0
4      20.0
       ... 
403    24.0
404    24.0
405    25.0
406    25.0
407    24.0
Name: Age, Length: 401, dtype: float64


In [7]:
df.isna().sum()

RecordID                                                                      0
طابع زمني                                                                     0
Are you a student at University of Gezira?                                    0
Age                                                                           6
Gender                                                                        1
Faculty                                                                       0
Academic level                                                                1
Causes of dental caries                                                       3
Causes of bleeding during tooth brushing:                                     5
Influence of dental plaque:                                                   8
Measures that prevent oral diseases:                                          8
Systemic diseases that may be related to oral diseases:                       7
Which is more important for oral health:

In [8]:
import statistics

for col in df.columns:
    if df[col].isna().sum() > 0 and df[col].isna().sum() < 20:
        df.fillna({col: statistics.mode(df[col])}, inplace=True)

In [9]:
df.isna().sum()

RecordID                                                                      0
طابع زمني                                                                     0
Are you a student at University of Gezira?                                    0
Age                                                                           0
Gender                                                                        0
Faculty                                                                       0
Academic level                                                                0
Causes of dental caries                                                       0
Causes of bleeding during tooth brushing:                                     0
Influence of dental plaque:                                                   0
Measures that prevent oral diseases:                                          0
Systemic diseases that may be related to oral diseases:                       0
Which is more important for oral health:

In [10]:
df.fillna({"Oral hygiene methods besides tooth brushing:": "None"}, inplace=True)
df.isna().sum()  # cleaning is done

RecordID                                                                     0
طابع زمني                                                                    0
Are you a student at University of Gezira?                                   0
Age                                                                          0
Gender                                                                       0
Faculty                                                                      0
Academic level                                                               0
Causes of dental caries                                                      0
Causes of bleeding during tooth brushing:                                    0
Influence of dental plaque:                                                  0
Measures that prevent oral diseases:                                         0
Systemic diseases that may be related to oral diseases:                      0
Which is more important for oral health: self-admini

In [11]:
df.rename(columns={"طابع زمني": "Timestamp"}, inplace=True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 401 entries, 0 to 407
Data columns (total 20 columns):
 #   Column                                                                     Non-Null Count  Dtype  
---  ------                                                                     --------------  -----  
 0   RecordID                                                                   401 non-null    int64  
 1   Timestamp                                                                  401 non-null    object 
 2   Are you a student at University of Gezira?                                 401 non-null    object 
 3   Age                                                                        401 non-null    float64
 4   Gender                                                                     401 non-null    object 
 5   Faculty                                                                    401 non-null    object 
 6   Academic level                                                 

In [13]:
df["Academic level"].value_counts()

Academic level
Third year     108
Fifth year      92
Fourth year     86
Second year     61
First year      54
Name: count, dtype: int64

Applying the inclusion and exclusion criteria reduces the sample size to dangerous levels to 254 from 401! This threatens the following:

- Statistical power (increased margin of error)
- Generalizability
- Validity of between-group comparisons

All Academic levels will be kept for now.

Note to researchers: Just be transparent about this change in your methodology section and explain it as a tradeoff to preserve sample integrity.

In [14]:
df["Gender"].value_counts()

Gender
Female    306
Male       95
Name: count, dtype: int64

In [15]:
df["Faculty"].value_counts()

Faculty
Medicine     341
Dentistry     60
Name: count, dtype: int64

There's 76% female and 24% male distribution and 85% medicine, 15% dentistry.

The study should account for the limitations of sampling that may also introduce bias. Consequently, the findings, particularly those related to gender or faculty specific trends should be interpreted with caution.

The smaller representation of dentistry students, for instance, may limit the generalizability of any comparisons made between faculties. Future studies are encouraged to adopt more stratified or randomized sampling approaches to ensure more balanced representation.

In [16]:
df.to_csv("oral-health-uofg-cleaned.csv", index=False)