### ***Project Overview***

*You are a data engineer at Kara Solutions, a leading data science company with over 50+ data-centric solutions. You are tasked with Building a data warehouse to store data on Ethiopian medical businesses scrapped from the web and telegram channels. This project involves several key steps and considerations to ensure the data warehouse is robust, scalable, and capable of handling the unique challenges associated with scraping and data collection from Telegram channels. Additionally, it involves integrating object detection capabilities using YOLO (You Only Look Once) to enhance data analysis.*

***This Section of the project deals with cleaning scraped telegram data and loading the cleaned raw data into a local postgresql database***

In [1]:
## Import required libraries
import pandas as pd
import numpy as np
import sys
import os
# Add the absolute path of the parent directory
sys.path.append('../scripts')

In [2]:
# import the modules
from data_cleaning import load_csv, clean_dataframe, save_cleaned_data
from database_setup import get_db_connection, create_table, insert_data

In [None]:
## load the data scrapped from telegram
df = load_csv('../data/telegram_data.csv')
df.head()

2025-01-30 15:22:52,559 - INFO - CSV file '../data/telegram_data.csv' loaded successfully.


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,Doctors Ethiopia,@DoctorsET,864,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...,2023-12-18 17:04:02+00:00,
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር \n\nለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀን...,2023-10-02 16:37:39+00:00,
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ?\n\nሙ...,2023-09-16 07:54:32+00:00,
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,


#### ***Clean and Standardize The Data***

In [4]:
## Clean the data
df_cleaned = clean_dataframe(df)

# Display cleaned dataset
df_cleaned.head(10)

2025-01-30 15:22:52,583 - INFO - Duplicates removed from dataset.
2025-01-30 15:22:52,604 - INFO - Date column formatted to datetime.
2025-01-30 15:22:52,621 - INFO - Missing values filled.
2025-01-30 15:22:52,684 - INFO - Text columns standardized.
2025-01-30 15:22:52,804 - INFO - Emojis extracted and stored in 'emoji_used' column.
2025-01-30 15:22:53,066 - INFO - YouTube links extracted and stored in 'youtube_links' column.
2025-01-30 15:22:53,067 - INFO - Data cleaning completed successfully.


Unnamed: 0,channel_title,channel_username,message_id,message,message_date,media_path,emoji_used,youtube_links
0,Doctors Ethiopia,@DoctorsET,864,"በቀን አንዴ ብቻ የሚባለው የቢዝነስ አማካሪ በ 10,000 ብር ብቻ የተ...",2023-12-18 17:04:02+00:00,No Media,👈👈👇👇,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,No Media,👇,https://youtu.be/gwVN5eJQpko?si=xARsSxIEdZtE91GY
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00,No Media,No emoji,https://youtu.be/oHiSRrNF7I0?si=Absgm414YSt_kjNq
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00,No Media,👇👇👇👇,https://youtu.be/tTeErZxIh_Q?si=jKHyfWcC3sfXbC8L
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,No Media,No emoji,https://youtu.be/0k65P5ouw7s?si=qaUgo75bUa3AMQxD
5,Doctors Ethiopia,@DoctorsET,859,ዶክተርስ ኢትዮጽያ በአዲስ ፕሮገራም ጀመረ ማረጥ (ሜኖፖዝ ) ጋር ተያይ...,2023-08-29 17:20:05+00:00,No Media,👇👇👇👇👇👇,https://youtu.be/-AR1KO2DbFw?si=47cXLZtlmhx1Nl...
6,Doctors Ethiopia,@DoctorsET,848,ክረምቱን ስፖርት መስራት አስበው ጂም ለመግባት ካልቻሉ ባሉበት ቦታ ሆነው...,2022-08-02 17:42:08+00:00,No Media,👇👇👇👇👇,https://youtu.be/0uiTzjEbh90
7,Doctors Ethiopia,@DoctorsET,847,ስፖርት የመስራት ሱስ ይኖር ይሆን? በአሁኑ ወቅት ብዙ የስፖርት መስሪያ ...,2022-06-12 17:15:47+00:00,No Media,👇👇👇👇👇👇,https://youtu.be/WPlRuRtQXN8
8,Doctors Ethiopia,@DoctorsET,846,ድንገተኛ አደጋ / የአጥንት ስብራት አያርገውና ድንገተኛ የሆነ አደጋ ቢደ...,2022-05-31 17:51:13+00:00,No Media,👇👇👇👇👇👇👇,https://youtu.be/QI-8oqW80uI
9,Doctors Ethiopia,@DoctorsET,845,ከትንሽ ግዚያት በፊት ስፖርት መስራት እንደ ቅንጦት ይታይ ነበር አሁን ላ...,2022-05-20 18:04:53+00:00,No Media,👇👇👇👇👇👇,https://youtu.be/_IEWt07bECg


In [5]:
# Check for missing values in the cleaned DataFrame
missing_values = df_cleaned.isnull().sum()
missing_values[missing_values > 0]  # Display only columns with missing values

Series([], dtype: int64)

In [6]:
# Save cleaned data to CSV
save_cleaned_data(df_cleaned, "../data/cleaned_telegram_data.csv")

2025-01-30 15:22:53,209 - INFO - Cleaned data saved successfully to '../data/cleaned_telegram_data.csv'.


Cleaned data saved successfully to '../data/cleaned_telegram_data.csv'.


### ***Load into Postgres Database***

In [9]:
## Establish a connection to the database
engine = get_db_connection()

2025-01-30 15:23:54,726 - INFO - Successfully connected to the PostgreSQL database.


In [10]:
## Create a table in the database
create_table(engine)

2025-01-30 15:24:18,643 - INFO - Table 'telegram_messages' created successfully.


In [11]:
# Load the cleaned CSV into a DataFrame
cleaned_df = pd.read_csv("../data/cleaned_telegram_data.csv")

In [12]:
# Ensure the 'message_date' column is in datetime format (to prevent NaT issues)
cleaned_df["message_date"] = pd.to_datetime(cleaned_df["message_date"], errors="coerce")

# Check if there are any missing values before inserting
missing_values = cleaned_df.isnull().sum()
print("Missing Values Before Insert:", missing_values)

Missing Values Before Insert: channel_title       0
channel_username    0
message_id          0
message             1
message_date        0
media_path          0
emoji_used          0
youtube_links       0
dtype: int64


In [13]:
## drop rows with missing values
cleaned_df.dropna(inplace=True)

In [14]:
# Visualize the first few rows of the cleaned DataFrame
cleaned_df.head()

Unnamed: 0,channel_title,channel_username,message_id,message,message_date,media_path,emoji_used,youtube_links
0,Doctors Ethiopia,@DoctorsET,864,"በቀን አንዴ ብቻ የሚባለው የቢዝነስ አማካሪ በ 10,000 ብር ብቻ የተ...",2023-12-18 17:04:02+00:00,No Media,👈👈👇👇,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...
1,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39+00:00,No Media,👇,https://youtu.be/gwVN5eJQpko?si=xARsSxIEdZtE91GY
2,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39+00:00,No Media,No emoji,https://youtu.be/oHiSRrNF7I0?si=Absgm414YSt_kjNq
3,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32+00:00,No Media,👇👇👇👇,https://youtu.be/tTeErZxIh_Q?si=jKHyfWcC3sfXbC8L
4,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15+00:00,No Media,No emoji,https://youtu.be/0k65P5ouw7s?si=qaUgo75bUa3AMQxD


In [15]:
# Insert into the database
insert_data(engine, cleaned_df)

2025-01-30 15:27:55,584 - INFO - Inserting: 864 - 2023-12-18 17:04:02+00:00
2025-01-30 15:27:55,593 - INFO - Inserting: 863 - 2023-11-03 16:14:39+00:00
2025-01-30 15:27:55,595 - INFO - Inserting: 862 - 2023-10-02 16:37:39+00:00
2025-01-30 15:27:55,597 - INFO - Inserting: 861 - 2023-09-16 07:54:32+00:00
2025-01-30 15:27:55,597 - INFO - Inserting: 860 - 2023-09-01 16:16:15+00:00
2025-01-30 15:27:55,601 - INFO - Inserting: 859 - 2023-08-29 17:20:05+00:00
2025-01-30 15:27:55,605 - INFO - Inserting: 848 - 2022-08-02 17:42:08+00:00
2025-01-30 15:27:55,606 - INFO - Inserting: 847 - 2022-06-12 17:15:47+00:00
2025-01-30 15:27:55,606 - INFO - Inserting: 846 - 2022-05-31 17:51:13+00:00
2025-01-30 15:27:55,611 - INFO - Inserting: 845 - 2022-05-20 18:04:53+00:00
2025-01-30 15:27:55,614 - INFO - Inserting: 844 - 2022-05-15 15:59:10+00:00
2025-01-30 15:27:55,614 - INFO - Inserting: 843 - 2022-05-07 18:22:14+00:00
2025-01-30 15:27:55,614 - INFO - Inserting: 842 - 2022-05-06 17:51:05+00:00
2025-01-30 1

In [16]:
# Check if the data was inserted successfully
query = "SELECT * FROM telegram_messages LIMIT 5;"
df_pg = pd.read_sql(query, engine)

df_pg

Unnamed: 0,id,channel_title,channel_username,message_id,message,message_date,media_path,emoji_used,youtube_links
0,1,Doctors Ethiopia,@DoctorsET,864,"በቀን አንዴ ብቻ የሚባለው የቢዝነስ አማካሪ በ 10,000 ብር ብቻ የተ...",2023-12-18 17:04:02,No Media,👈👈👇👇,https://youtu.be/5DBoEm-8kmA?si=LDLuEecNfULJVD...
1,2,Doctors Ethiopia,@DoctorsET,863,ዶክተርስ ኢትዮጵያ በ አዲስ አቀራረብ በ ቴሌቪዥን ፕሮግራሙን ለመጀመር ከ...,2023-11-03 16:14:39,No Media,👇,https://youtu.be/gwVN5eJQpko?si=xARsSxIEdZtE91GY
2,3,Doctors Ethiopia,@DoctorsET,862,ሞት በስኳር ለልጆቻችን የምናሲዘው ምሳቃ ሳናቀው እድሚያቸውን ይቀንሰው ...,2023-10-02 16:37:39,No Media,No emoji,https://youtu.be/oHiSRrNF7I0?si=Absgm414YSt_kjNq
3,4,Doctors Ethiopia,@DoctorsET,861,ከ HIV የተፈወሰ ሰው አጋጥሟችሁ ያቃል ? ፈውስ እና ህክምና ? ሙሉ ቪ...,2023-09-16 07:54:32,No Media,👇👇👇👇,https://youtu.be/tTeErZxIh_Q?si=jKHyfWcC3sfXbC8L
4,5,Doctors Ethiopia,@DoctorsET,860,በቅርብ ጊዜ በሃገራችን ላይ እየተስተዋለ ያለ የተመሳሳይ ፆታ ( Homos...,2023-09-01 16:16:15,No Media,No emoji,https://youtu.be/0k65P5ouw7s?si=qaUgo75bUa3AMQxD
