DAY 4: Data Integration in Database

# 0. Check and Install all necessary libraries

In [66]:
!pip install pandas numpy sqlalchemy psycopg2-binary




## 1. Library Imports

In this section, I import all necessary libraries for data loading, cleaning, and database integration.  
- `pandas` is used for data manipulation and cleaning.
- `numpy` helps with numeric and missing value operations.
- `sqlalchemy` is used for connecting and writing data to the PostgreSQL database.

In [67]:
import pandas as pd
import numpy as np
import psycopg2
import seaborn as sns
from sqlalchemy import create_engine

# 2. Connect to Database

In this section, I set up the connection to the PostgreSQL database using SQLAlchemy and psycopg2.  
The code connects to the remote database with the given credentials and creates an engine for running SQL queries from Python.

In [68]:
db_user = "neondb_owner"
db_pass = "npg_CeS9fJg2azZD"
db_host = "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech"
db_port = "5432"
db_name = "neondb"

# SQLAlchemy Connection String
engine = create_engine(
    f"postgresql+psycopg2://{db_user}:{db_pass}@{db_host}:{db_port}/{db_name}?sslmode=require"
)

In [69]:
import os
print(os.getcwd())


/Users/jennypetschke/Documents/webeet_internship/Repositories/_onboarding_data/daily_tasks/day_4


## 3. Load and Inspect Raw SAT Dataset

Here I load the raw SAT results CSV file and take a first look at the structure and content of the dataset.

In [70]:
df = pd.read_csv("day_4_datasets/sat-results.csv")
df.head()


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


## 4. Data Cleaning: Rename Columns and Inspect for Issues

First, I standardize all column names for consistency and easier data cleaning.  
After that, I check for duplicates and missing values, focusing especially on key columns such as `dbn` and SAT scores.


## I. Column Overview

Here, I review the data types and columns in the SAT dataset to plan the next cleaning steps.

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


## II. Clean and Standardize Column Names

First, I will standardize column names for easier access, and identify any duplicate or misspelled columns.


In [72]:
df.columns = [col.strip().lower().replace(" ", "_").replace(".", "").replace("-", "_") for col in df.columns]
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             493 non-null    object 
 1   school_name                     493 non-null    object 
 2   num_of_sat_test_takers          493 non-null    object 
 3   sat_critical_reading_avg_score  493 non-null    object 
 4   sat_math_avg_score              493 non-null    object 
 5   sat_writing_avg_score           493 non-null    object 
 6   sat_critical_readng_avg_score   493 non-null    object 
 7   internal_school_id              493 non-null    int64  
 8   contact_extension               388 non-null    object 
 9   pct_students_tested             376 non-null    object 
 10  academic_tier_rating            402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


## III.Inspect and Handle Duplicates and Missing Values

### Checking for Missing Values

Now, I inspect the dataset for missing values in each column.  
This helps determine which fields need to be cleaned or imputed before loading into the database.


In [73]:
# Show only columns with at least one missing value
missing = df.isnull().sum()
cols_with_missing = missing[missing > 0]
print(cols_with_missing)

contact_extension       105
pct_students_tested     117
academic_tier_rating     91
dtype: int64


### Checking for Duplicate School Records

The `dbn` column should uniquely identify each school.  
I check for any duplicate DBN values in the dataset and display them if found.

In [74]:
# Count duplicates in the DBN column
duplicate_count = df['dbn'].duplicated().sum()
print(f"Number of duplicated DBNs: {duplicate_count}")

# Find all duplicate DBNs (both the first and subsequent occurrences)
dups = df[df['dbn'].duplicated(keep=False)].sort_values('dbn')
display(dups)


Number of duplicated DBNs: 15


Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,sat_critical_readng_avg_score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
35,02M419,LANDMARK HIGH SCHOOL,62,390,399,381,390,166135,x123,78%,2.0
486,02M419,LANDMARK HIGH SCHOOL,62,390,399,381,390,166135,x123,78%,2.0
52,02M520,MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS,264,407,440,393,407,892839,,92%,2.0
484,02M520,MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS,264,407,440,393,407,892839,,92%,2.0
491,02M520,MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS,264,407,440,393,407,892839,,92%,2.0
99,05M304,MOTT HALL HIGH SCHOOL,54,413,399,398,413,296405,x123,78%,2.0
490,05M304,MOTT HALL HIGH SCHOOL,54,413,399,398,413,296405,x123,78%,2.0
487,05M304,MOTT HALL HIGH SCHOOL,54,413,399,398,413,296405,x123,78%,2.0
481,07X221,SOUTH BRONX PREPARATORY: A COLLEGE BOARD SCHOOL,65,364,378,348,364,277389,x345,92%,
492,07X221,SOUTH BRONX PREPARATORY: A COLLEGE BOARD SCHOOL,65,364,378,348,364,277389,x345,92%,


In [75]:
df = df.drop_duplicates(subset='dbn', keep='first')
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 478 entries, 0 to 477
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             478 non-null    object 
 1   school_name                     478 non-null    object 
 2   num_of_sat_test_takers          478 non-null    object 
 3   sat_critical_reading_avg_score  478 non-null    object 
 4   sat_math_avg_score              478 non-null    object 
 5   sat_writing_avg_score           478 non-null    object 
 6   sat_critical_readng_avg_score   478 non-null    object 
 7   internal_school_id              478 non-null    int64  
 8   contact_extension               378 non-null    object 
 9   pct_students_tested             363 non-null    object 
 10  academic_tier_rating            392 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 44.8+ KB


After checking for duplicates using the `dbn` column, I removed any duplicate school records.  
The dataset now contains 478 unique schools.

### Handling Missing Values

I remove any rows with missing values in essential columns (DBN, school name, and SAT score columns), since these are required for analysis and database integration.  
Missing values in optional columns are left as NaN and will be stored as NULL in the database.

In [76]:
df = df.dropna(subset=[
    'dbn',
    'school_name',
    'num_of_sat_test_takers',
    'sat_critical_reading_avg_score',
    'sat_math_avg_score',
    'sat_writing_avg_score'
])
missing_new = df.isnull().sum()
cols_with_missing_new = missing_new[missing_new > 0]
print(cols_with_missing_new)



contact_extension       100
pct_students_tested     115
academic_tier_rating     86
dtype: int64


### Removing Duplicate or Misspelled Columns

I check for duplicate or misspelled columns such as `sat_critical_reading_avg_score` and `sat_critical_readng_avg_score`.  
If one is empty or redundant, I remove it.

In [77]:
# Check values in both columns
print(df[['sat_critical_reading_avg_score', 'sat_critical_readng_avg_score']].isnull().sum())
print(df[['sat_critical_reading_avg_score', 'sat_critical_readng_avg_score']].head(10))

# If one column is empty or a duplicate, drop it
df = df.drop(columns=['sat_critical_readng_avg_score'])


sat_critical_reading_avg_score    0
sat_critical_readng_avg_score     0
dtype: int64
  sat_critical_reading_avg_score sat_critical_readng_avg_score
0                            355                           355
1                            383                           383
2                            377                           377
3                            414                           414
4                            390                           390
5                            332                           332
6                            522                           522
7                            417                           417
8                            624                           624
9                            395                           395


### Converting Columns to Numeric Types

I convert all relevant columns to numeric types for accurate analysis and database integration.

In [78]:
num_cols = [
    'num_of_sat_test_takers',
    'sat_critical_reading_avg_score',
    'sat_math_avg_score',
    'sat_writing_avg_score',
    'academic_tier_rating',
    'pct_students_tested'
]

for col in num_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')


### Final Data Quality Check

I review the final DataFrame to ensure all columns have the correct types and there are no missing values in essential fields.

In [79]:
df.info()
df.head()
print(df.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
Index: 478 entries, 0 to 477
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             478 non-null    object 
 1   school_name                     478 non-null    object 
 2   num_of_sat_test_takers          421 non-null    float64
 3   sat_critical_reading_avg_score  421 non-null    float64
 4   sat_math_avg_score              421 non-null    float64
 5   sat_writing_avg_score           421 non-null    float64
 6   internal_school_id              478 non-null    int64  
 7   contact_extension               378 non-null    object 
 8   pct_students_tested             0 non-null      float64
 9   academic_tier_rating            392 non-null    float64
dtypes: float64(6), int64(1), object(3)
memory usage: 41.1+ KB
dbn                                 0
school_name                         0
num_of_sat_test_takers       

# 5. Exporting Cleaned Data

I export the cleaned SAT results as a CSV file for integration into the PostgreSQL database.


In [80]:
import os
import pandas as pd

cleaned_path = "day_4_task/sat_results_cleaned.csv"
raw_path = "day_4_datasets/sat-results.csv"

if not os.path.exists(cleaned_path):
    # Lade und bereinige die Rohdaten
    df = pd.read_csv(raw_path)
    # [Hier deine Cleaning-Schritte einfügen!]
    # Beispiel: Standardisieren der Spaltennamen
    df.columns = [col.strip().lower().replace(" ", "_").replace(".", "").replace("-", "_") for col in df.columns]
    # (Restliche Cleaning-Schritte...)

    # Speichern der bereinigten Datei
    df.to_csv(cleaned_path, index=False)
    print(f"Cleaned CSV saved as {cleaned_path}")
else:
    print(f"Cleaned CSV already exists at {cleaned_path}. Please delete it first if you want to re-create it.")


Cleaned CSV saved as day_4_task/sat_results_cleaned.csv
