## Data Cleanup

Now that we loaded the data from switchup website, we need to clean data and prepare them to insert into
database. 

Activities:
* Remove HTML like codes from fields such as Description
* Check for NULL values and decide what to do with them

In [3]:
#imports
import pandas as pd
import re

In [2]:
# Load CSV files we downloaded, in order to clean them
badges = pd.read_csv('badges.csv')
comments = pd.read_csv('comments.csv')
schools = pd.read_csv('schools.csv')
locations = pd.read_csv('locations.csv')
courses = pd.read_csv('courses.csv')

#### Remove HTML

In [4]:
def strip_html(str):
    return re.sub('<[^<]+?>', '', str)

In [13]:
schools['description'] = schools.apply(lambda r: strip_html(r['description']), axis=1)
schools.head()

Unnamed: 0.1,Unnamed: 0,website,description,LogoUrl,school,school_id
0,0,www.ironhack.com/en,Ironhack is a global tech school with 9 campus...,https://d92mrp7hetgfk.cloudfront.net/images/si...,ironhack,10828
1,0,appacademy.io,"Founded in 2012, App Academy is a world-renown...",https://d92mrp7hetgfk.cloudfront.net/images/si...,app-academy,10525
2,0,www.springboard.com/?utm_source=switchup&utm_m...,Springboard is an online learning platform tha...,https://d92mrp7hetgfk.cloudfront.net/images/si...,springboard,11035
3,0,anyonecanlearntocode.com/?utm_source=switchup&...,Actualize is a coding bootcamp that values qua...,https://d92mrp7hetgfk.cloudfront.net/images/si...,actualize,10505
4,0,learningfuze.com,"LearningFuze is an immersive, 14-week web deve...",https://d92mrp7hetgfk.cloudfront.net/images/si...,learningfuze,10862


In [15]:
badges['description'] = badges.apply(lambda r: strip_html(r['description']), axis=1)
badges.head()

Unnamed: 0.1,Unnamed: 0,name,keyword,description,school,school_id
0,0,Available Online,available_online,School offers fully online courses,ironhack,10828
1,1,Verified Outcomes,verified_outcomes,School publishes a third-party verified outcom...,ironhack,10828
2,2,Flexible Classes,flexible_classes,School offers part-time and evening classes,ironhack,10828
3,0,Available Online,available_online,School offers fully online courses,app-academy,10525
4,1,Flexible Classes,flexible_classes,School offers part-time and evening classes,app-academy,10525


In [18]:
comments.head()

Unnamed: 0.1,Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
0,0,306215,Anonymous,True,,2023.0,True,,Transformative Experience: My Time at Ironhack,"<span class=""truncatable""><p></p><p>Pros: 1)In...",...,2023-11-06,Web Development Bootcamp,{'image': None},4.0,[],4.0,4.0,4.0,Pros: 1)Intensive Learning 2)Real-World Projec...,ironhack
1,1,306068,Anonymous,True,,2023.0,False,Full stack development,Now I can do it,"<span class=""truncatable""><p></p><p>7 months a...",...,2023-10-31,,{'image': None},5.0,[],5.0,5.0,5.0,"7 months ago, I only had an idea about html an...",ironhack
2,2,305297,Utku Cikmaz,False,,2023.0,False,Full Stack Web Developer,It was good,"<span class=""truncatable""><p></p><p>The course...",...,2023-10-02,Web Development Bootcamp,{'image': None},4.0,[],5.0,3.0,4.0,"The course was great. Especially, Luis is a gr...",ironhack
3,3,305278,Nirmal Hodge,False,,2023.0,False,Product Designer,Ironhack 100% Worth It!,"<span class=""truncatable""><p></p><p>I joined t...",...,2023-09-30,UX/UI Design Bootcamp,{'image': None},5.0,[],5.0,5.0,5.0,I joined the UX/ UI Bootcamp and to be honest ...,ironhack
4,4,305231,Anonymous,True,,2023.0,False,,Still waiting a refund,"<span class=""truncatable""><p></p><p>Unfortunat...",...,2023-09-28,Web Development Bootcamp,{'image': None},1.0,[],1.0,1.0,1.0,Unfortunately wouldn’t recommend it. Still wai...,ironhack


In [23]:
comments['body'] = comments.apply(lambda r: strip_html(r['body']), axis=1)
comments.head()

Unnamed: 0.1,Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
0,0,306215,Anonymous,True,,2023.0,True,,Transformative Experience: My Time at Ironhack,Pros: 1)Intensive Learning 2)Real-World Projec...,...,2023-11-06,Web Development Bootcamp,{'image': None},4.0,[],4.0,4.0,4.0,Pros: 1)Intensive Learning 2)Real-World Projec...,ironhack
1,1,306068,Anonymous,True,,2023.0,False,Full stack development,Now I can do it,"7 months ago, I only had an idea about html an...",...,2023-10-31,,{'image': None},5.0,[],5.0,5.0,5.0,"7 months ago, I only had an idea about html an...",ironhack
2,2,305297,Utku Cikmaz,False,,2023.0,False,Full Stack Web Developer,It was good,"The course was great. Especially, Luis is a gr...",...,2023-10-02,Web Development Bootcamp,{'image': None},4.0,[],5.0,3.0,4.0,"The course was great. Especially, Luis is a gr...",ironhack
3,3,305278,Nirmal Hodge,False,,2023.0,False,Product Designer,Ironhack 100% Worth It!,I joined the UX/ UI Bootcamp and to be honest ...,...,2023-09-30,UX/UI Design Bootcamp,{'image': None},5.0,[],5.0,5.0,5.0,I joined the UX/ UI Bootcamp and to be honest ...,ironhack
4,4,305231,Anonymous,True,,2023.0,False,,Still waiting a refund,Unfortunately wouldn’t recommend it. Still wai...,...,2023-09-28,Web Development Bootcamp,{'image': None},1.0,[],1.0,1.0,1.0,Unfortunately wouldn’t recommend it. Still wai...,ironhack


#### Decide About empty values

In [24]:
badges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   68 non-null     int64 
 1   name         68 non-null     object
 2   keyword      68 non-null     object
 3   description  68 non-null     object
 4   school       68 non-null     object
 5   school_id    68 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 3.3+ KB


In [32]:
comments.head()

Unnamed: 0.1,Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
0,0,306215,Anonymous,True,,2023.0,True,,Transformative Experience: My Time at Ironhack,Pros: 1)Intensive Learning 2)Real-World Projec...,...,2023-11-06,Web Development Bootcamp,{'image': None},4.0,[],4.0,4.0,4.0,Pros: 1)Intensive Learning 2)Real-World Projec...,ironhack
1,1,306068,Anonymous,True,,2023.0,False,Full stack development,Now I can do it,"7 months ago, I only had an idea about html an...",...,2023-10-31,,{'image': None},5.0,[],5.0,5.0,5.0,"7 months ago, I only had an idea about html an...",ironhack
2,2,305297,Utku Cikmaz,False,,2023.0,False,Full Stack Web Developer,It was good,"The course was great. Especially, Luis is a gr...",...,2023-10-02,Web Development Bootcamp,{'image': None},4.0,[],5.0,3.0,4.0,"The course was great. Especially, Luis is a gr...",ironhack
3,3,305278,Nirmal Hodge,False,,2023.0,False,Product Designer,Ironhack 100% Worth It!,I joined the UX/ UI Bootcamp and to be honest ...,...,2023-09-30,UX/UI Design Bootcamp,{'image': None},5.0,[],5.0,5.0,5.0,I joined the UX/ UI Bootcamp and to be honest ...,ironhack
4,4,305231,Anonymous,True,,2023.0,False,,Still waiting a refund,Unfortunately wouldn’t recommend it. Still wai...,...,2023-09-28,Web Development Bootcamp,{'image': None},1.0,[],1.0,1.0,1.0,Unfortunately wouldn’t recommend it. Still wai...,ironhack


In [28]:
schools.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   24 non-null     int64 
 1   website      24 non-null     object
 2   description  24 non-null     object
 3   LogoUrl      24 non-null     object
 4   school       24 non-null     object
 5   school_id    24 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 1.2+ KB


In [35]:
locations.head()

Unnamed: 0.1,Unnamed: 0,id,description,country.id,country.name,country.abbrev,city.id,city.name,city.keyword,state.id,state.name,state.abbrev,state.keyword,school,school_id
0,0,15901,"Berlin, Germany",57.0,Germany,DE,31156.0,Berlin,berlin,,,,,ironhack,10828
1,1,16022,"Mexico City, Mexico",29.0,Mexico,MX,31175.0,Mexico City,mexico-city,,,,,ironhack,10828
2,2,16086,"Amsterdam, Netherlands",59.0,Netherlands,NL,31168.0,Amsterdam,amsterdam,,,,,ironhack,10828
3,3,16088,"Sao Paulo, Brazil",42.0,Brazil,BR,31121.0,Sao Paulo,sao-paulo,,,,,ironhack,10828
4,4,16109,"Paris, France",38.0,France,FR,31136.0,Paris,paris,,,,,ironhack,10828


In [30]:
courses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  216 non-null    int64 
 1   courses     216 non-null    object
 2   school      216 non-null    object
 3   school_id   216 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 6.9+ KB


It seems we don't need to deal that much with null values. We can decide later
if we want to remove them or default them to something.

#### Store cleaned up data

In [31]:
schools.to_csv('schools.csv')
badges.to_csv('badges.csv')
courses.to_csv('courses.csv')
locations.to_csv('locations.csv')
comments.to_csv('comments.csv')

Data for the schema on the dbdesigner.net


badges {
    name string
    keyword string
    description string
    school string > schools.school
    school_id number
}

schools {
    website string
    description string
    LogoUrl string
    school string
    school_id number
}


courses {
    courses string
    school string > schools.school
    school_id number
}

locations {
   id            number 
   description   string
   country.id    number
   country.name  string
   country.abbrev string
   city.id        number
   city.name      string
   city.keyword   string
   state.id       number
   state.name     string
   state.abbrev   string
   state.keyword  string
   school         string > schools.school
   school_id      number
}

comments {
  id               number  
  name            string 
  anonymous       string  
  hostProgramName  string
  graduatingYear   number
  isAlumni         string
  jobTitle         string 
  tagline          string
  body             string 
  rawBody          string 
  createdAt        string 
  queryDate        string 
  program          string 
  user             string
  overallScore     number
 comments         string 
 overall          number
 curriculum       string
 jobSupport       string
 review_body      string
  school           string > schools.school
}