# Data Cleaning and Preparation

### Taking a look at the data, and preparing it for insertion into a database.

I initially thought I'd have to scrape the site, but it turns out that the server's response containing the data is viewable in the browser's developer tools. I filtered by English-language Bachelor's courses and copied the `graphql` response into [english_bachelor_courses.json](./english_bachelor_courses.json).

In [53]:
import pandas as pd
import json

In [54]:
with open('english_bachelor_courses.json') as f:
    data = json.load(f)

courses_df = pd.json_normalize(data['data']['searchRealizations'])
courses_df.head()
print(f'Number of courses: {len(courses_df)}')

Number of courses: 452


Renaming columns to conform to SQL naming conventions:

In [55]:
courses_df.rename(columns={'__typename': 'type_name', 'timing.start': 'timing_start', 'timing.end': 'timing_end',
                            'enrollment.start': 'enrollment_start', 'enrollment.end': 'enrollment_end'}, inplace=True)

Checking to see if `id` and `code` are always the same, if they are we can drop `code`:

In [56]:
rows_where_unequal = len(courses_df[courses_df['id'] != courses_df['code']])
print(f'Number of rows where id and code are not equal: {rows_where_unequal}')

Number of rows where id and code are not equal: 0


In [57]:
courses_df.drop(columns=['code'], inplace=True)

In [58]:
courses_df.iloc[0]

id                                                      TLLY3500-3009
title                                           Logistics Simulations
credits                                                             5
degreeProgrammes    [{'id': '5284', 'title': 'Bachelor's Degree Pr...
studentGroups       [{'id': '143487', 'code': 'ZJATLS23SMM', 'titl...
type_name                                                 Realization
timing_start                                               2024-01-08
timing_end                                                 2024-05-20
enrollment_start                                           2023-11-20
enrollment_end                                             2024-01-04
Name: 0, dtype: object

In [59]:
courses_df.iloc[0]['degreeProgrammes']

[{'id': '5284',
  'title': "Bachelor's Degree Programme in Logistics",
  '__typename': 'DegreeProgramme'}]

In [60]:
courses_df.iloc[0]['studentGroups']

[{'id': '143487',
  'code': 'ZJATLS23SMM',
  'title': 'Avoin amk, Logistiikka, Monimuoto',
  '__typename': 'StudentGroup'},
 {'id': '131534',
  'code': 'TLS23SMM',
  'title': 'Logistiikka - tutkinto-ohjelma (AMK)',
  '__typename': 'StudentGroup'}]

In [61]:
degreeProgrammes_courses_data = []
studentGroups_courses_data = []

for index, row in courses_df.iterrows():
    for degreeProgramme in row['degreeProgrammes']:
        degreeProgrammes_courses_data.append({'id': degreeProgramme['id'], 'name': degreeProgramme['title'], 'course_id': row['id']})

    for studentGroup in row['studentGroups']:
        studentGroups_courses_data.append({'id': studentGroup['id'], 'code': studentGroup['code'], 'name': studentGroup['title'],
                                            'type_name': studentGroup['__typename'], 'course_id': row['id']})

degree_programmes_df = pd.DataFrame(degreeProgrammes_courses_data)
student_groups_df = pd.DataFrame(studentGroups_courses_data)

In [62]:
degree_programmes_df.head()

Unnamed: 0,id,name,course_id
0,5284,Bachelor's Degree Programme in Logistics,TLLY3500-3009
1,22411,Bachelor's Degree Programme in Purchasing and ...,ZZPP0520-3210
2,82940,Bachelor's Degree Programme in Information and...,TTC2070-3016
3,5290,Bachelor's Degree Programme in Information and...,TTC2070-3016
4,5265,Bachelor's Degree Programme in Nursing,SWNSW205-3003


In [63]:
student_groups_df.head()

Unnamed: 0,id,code,name,type_name,course_id
0,143487,ZJATLS23SMM,"Avoin amk, Logistiikka, Monimuoto",StudentGroup,TLLY3500-3009
1,131534,TLS23SMM,Logistiikka - tutkinto-ohjelma (AMK),StudentGroup,TLLY3500-3009
2,143480,ZJATLP23S1,"Avoin amk, Purchasing and Logistics Engineerin...",StudentGroup,ZZPP0520-3210
3,131506,TLP23S1,Bachelor's Degree Programme in Purchasing and ...,StudentGroup,ZZPP0520-3210
4,119696,TTV22S5,Tieto- ja viestintätekniikka (AMK),StudentGroup,TTC2070-3016


In [64]:
student_groups_df['type_name'].unique()

array(['StudentGroup'], dtype=object)

Dropping `type_name` column from `student_groups_df` as it's always the same. Also dropping `degreeProgrammes` and `studentGroups` from `courses_df` as they now have their own dataframes.

In [65]:
student_groups_df.drop(columns=['type_name'], inplace=True)
courses_df.drop(columns=['degreeProgrammes', 'studentGroups'], inplace=True)

In [66]:
courses_df.head()

Unnamed: 0,id,title,credits,type_name,timing_start,timing_end,enrollment_start,enrollment_end
0,TLLY3500-3009,Logistics Simulations,5,Realization,2024-01-08,2024-05-20,2023-11-20,2024-01-04
1,ZZPP0520-3210,Development as an Expert,5,Realization,2023-08-21,2026-07-31,2023-08-01,2023-08-24
2,TTC2070-3016,Project Management and Practices,4,Realization,2023-08-28,2023-12-11,2023-08-01,2023-08-24
3,SWNSW205-3003,Practice in Perioperative Nursing,6,Realization,2023-08-08,2024-05-19,2023-08-01,2023-08-24
4,TLTT6500-3011,Logistics Information Technology,5,Realization,2024-01-08,2024-05-20,2023-11-20,2024-01-04
