# Cleaning

### 1-Dataset Sample:

In [43]:
import pandas as pd
survey_data = pd.read_csv('SurveyData.csv')


print("Sample Data:")
print(survey_data.head())


Sample Data:
  the age:   Gender:   Area: Current educational level: marital status:  \
0    13-17  feminine  Riyadh  High school or equivalent        bachelor   
1    18-24      male  Riyadh          Bachelor's degree        bachelor   
2    18-24  feminine  Riyadh          Bachelor's degree        bachelor   
3    18-24  feminine  Riyadh  High school or equivalent        bachelor   
4    35-44  feminine  Riyadh          Bachelor's degree         married   

      Employment status: Do you use social media applications?  \
0                student                                   Yes   
1           Not employed                                   Yes   
2                student                                   Yes   
3                student                                   Yes   
4  Housewife, unemployed                                   Yes   

             What social media platforms do you use?  \
0  Instagram, X (Twitter), TikTok, Snapchat, Yout...   
1  Instagram, X (Twitter), 

### 2-Ignoring the Last two Columns: 

In this step, we are ignoring the last two columns now because they will be used later in the preprocessing stage with algorithms such as sentiment analysis and tokenization


In [44]:
survey_data = survey_data.drop(survey_data.columns[-2:], axis=1)

print("Data after ignoring (deleting) the last two columns:")
print(survey_data.head())

Data after ignoring (deleting) the last two columns:
  the age:   Gender:   Area: Current educational level: marital status:  \
0    13-17  feminine  Riyadh  High school or equivalent        bachelor   
1    18-24      male  Riyadh          Bachelor's degree        bachelor   
2    18-24  feminine  Riyadh          Bachelor's degree        bachelor   
3    18-24  feminine  Riyadh  High school or equivalent        bachelor   
4    35-44  feminine  Riyadh          Bachelor's degree         married   

      Employment status: Do you use social media applications?  \
0                student                                   Yes   
1           Not employed                                   Yes   
2                student                                   Yes   
3                student                                   Yes   
4  Housewife, unemployed                                   Yes   

             What social media platforms do you use?  \
0  Instagram, X (Twitter), TikTok, Snapchat

### 3-Deleting unnecessary Rows:

In this step, we will delete all rows where the answer is 'No' in the "Do you use social media applications?" column, as these responses are not relevant to our analysis

In [45]:
survey_data = survey_data[survey_data["Do you use social media applications?"] != "No"]
print("Data after deleting rows with 'No' in the 'Do you use social media applications?' column:")
social_media_answers_sample = survey_data["Do you use social media applications?"].head(10)
print(social_media_answers_sample)

Data after deleting rows with 'No' in the 'Do you use social media applications?' column:
0    Yes
1    Yes
2    Yes
3    Yes
4    Yes
5    Yes
6    Yes
7    Yes
8    Yes
9    Yes
Name: Do you use social media applications?, dtype: object


### 4-Checking missing values:

In [46]:

missing_values = survey_data.isnull().sum()

print("Missing values in each column:")
print(missing_values)

if missing_values.sum() == 0:
    print("\nThere are no missing values in the dataset.")
else:
    print("\nThere are missing values in the dataset. Please review the above output for details.")


Missing values in each column:
the age:                                                                                                                                                               0
Gender:                                                                                                                                                                0
Area:                                                                                                                                                                  0
Current educational level:                                                                                                                                             0
marital status:                                                                                                                                                        0
Employment status:                                                                                                          

### 5- Dealing with "#VALUE!" 

Now, we are going to check if the value "#VALUE!" occurs in the dataset, as it represents unanswered columns in the survey. 

In [47]:
import pandas as pd

value_error_count = (survey_data == "#VALUE!").sum().sum()

if value_error_count > 0:
    print(f"Found {value_error_count} occurrences of '#VALUE!' in the dataset. Deleting them now...")
    survey_data = survey_data.replace("#VALUE!", pd.NA).dropna()
    print("Rows with '#VALUE!' have been deleted.")
else:
    print("No occurrences of '#VALUE!' found in the dataset.")

print("\nUpdated data sample after cleaning:")
print(survey_data.head())

Found 156 occurrences of '#VALUE!' in the dataset. Deleting them now...
Rows with '#VALUE!' have been deleted.

Updated data sample after cleaning:
  the age:   Gender:   Area: Current educational level: marital status:  \
0    13-17  feminine  Riyadh  High school or equivalent        bachelor   
1    18-24      male  Riyadh          Bachelor's degree        bachelor   
2    18-24  feminine  Riyadh          Bachelor's degree        bachelor   
3    18-24  feminine  Riyadh  High school or equivalent        bachelor   
4    35-44  feminine  Riyadh          Bachelor's degree         married   

      Employment status: Do you use social media applications?  \
0                student                                   Yes   
1           Not employed                                   Yes   
2                student                                   Yes   
3                student                                   Yes   
4  Housewife, unemployed                                   Yes   

    

# Pre-Prossesing 

### Sample Dataset

In [48]:
print("Sample Data:")
print(survey_data.head())

Sample Data:
  the age:   Gender:   Area: Current educational level: marital status:  \
0    13-17  feminine  Riyadh  High school or equivalent        bachelor   
1    18-24      male  Riyadh          Bachelor's degree        bachelor   
2    18-24  feminine  Riyadh          Bachelor's degree        bachelor   
3    18-24  feminine  Riyadh  High school or equivalent        bachelor   
4    35-44  feminine  Riyadh          Bachelor's degree         married   

      Employment status: Do you use social media applications?  \
0                student                                   Yes   
1           Not employed                                   Yes   
2                student                                   Yes   
3                student                                   Yes   
4  Housewife, unemployed                                   Yes   

             What social media platforms do you use?  \
0  Instagram, X (Twitter), TikTok, Snapchat, Yout...   
1  Instagram, X (Twitter), 

### 1- Data Transformation 

Here, we identified and replaced inconsistent or incorrect values in our dataset. For example, we replaced "feminine" with "Female" in the Gender column and corrected city names like "grandmother" to "Jeddah," "the news" to "Khobar," and "City" to "Madinah" in the City column.



In [49]:


# Replace "feminine" with "Female" in the Gender column
survey_data['Gender:'] = survey_data['Gender:'].replace("feminine", "Female")

# Replace incorrect city names in the City column
survey_data['Area:'] = survey_data['Area:'].replace({
    "grandmother": "Jeddah", 
    "the news": "Khobar", 
    "City": "Madinah"
})

# Display a sample of the updated dataset
print("\nUpdated data sample after replacing incorrect words:")
print(survey_data[['Gender:', 'Area:']].head(250))



Updated data sample after replacing incorrect words:
    Gender:   Area:
0    Female  Riyadh
1      male  Riyadh
2    Female  Riyadh
3    Female  Riyadh
4    Female  Riyadh
..      ...     ...
247  Female  Riyadh
248  Female  Jeddah
249  Female  Riyadh
250    male  Riyadh
251  Female    Abha

[250 rows x 2 columns]


### 2- Range(1-5):

Now, we are converting the survey responses into a numerical range from 1 to 5. This step ensures that the data is standardized and ready for analysis. 

In [51]:

columns_to_convert = [
    "Do you feel anxious or stressed after reading negative comments on your posts?",
    "Are you worried about missing out on important information or events when you're not using social media?", 
    "Do you feel that using social media has affected your ability to focus and accomplish daily tasks?", 
    "Do you think that consuming quick content (such as watching short videos and push notifications...) has affected your patience and ability to deal with long tasks?", 
    "Do you use social media right before going to sleep?", 
    "Do you have difficulty sleeping because of thinking about what you saw on social media platforms?", 
    "Does the number of likes or comments you get on your posts affect you?", 
    "Have you changed your opinion or feeling based on the reactions of others on social media platforms?", 
    "Do you prefer interacting with friends or family online rather than face-to-face?", 
    "How often do you find yourself using social media for longer than you planned?"
]

response_mapping = {
    "Yes, always": 1,
    "always": 1,
    "Yes, a lot": 2,
    "often": 2,
    "sometimes": 3,
    "Rarely": 4,
    "rarely": 4,
    "No, never": 5,
    "never": 5
}

for column in columns_to_convert:
    survey_data[column] = survey_data[column].replace(response_mapping)

print("\nUpdated data sample with answers converted to range 1-5:")
print(survey_data[columns_to_convert].head(10))


  survey_data[column] = survey_data[column].replace(response_mapping)



Updated data sample with answers converted to range 1-5:
  Do you feel anxious or stressed after reading negative comments on your posts?  \
0                                                  4                               
1                                                  5                               
2                                                  1                               
3                                                  3                               
4                                                  5                               
5                                                  3                               
6                                                  1                               
7                                                  5                               
8                                                  4                               
9                                                  3                               

   Are you worrie