## ICX Media Data Science Challenge

#### Vishal Hundal

The dataset used for this challenge is the [wiki4HE Data Set](https://archive.ics.uci.edu/ml/datasets/wiki4HE) from the UCI Machine Learning Repository. It is a survey of faculty members from two Spanish universities on teaching uses of Wikipedia.

#### This repository contains:
- `orignalDataset.csv`: Downloaded dataset, that surveys faculty members from two Spanish universities on how they percieve and use Wikipedia in teaching/acadameia.


- `cleanDataset.csv`: Cleaned and munged version of the original dataset.


- `ICX Media Data Science Challenge.ipynb`: The completed challenge that does exploratory data analysis, clustering and classification on the dataset.

#### Dataset Description:
This dataset contains 53 attributes and 913 datapoints. The attributes are:
1. **Age**: Integer


2. **Gender**: Male=0; Female=1


3. **Domain**: Arts & Humanities=1; Sciences=2; Health Sciences=3; Engineering & Architecture=4; Law & Politics=5


4. **PhD**: No=0; Yes=1


5. **Experience Years (YEARSEXP)**: Integer


6. **University**: Open University of Catalonia (UOC)=0; Pompeu Fabra University (UPF)=1


7. **Position at UOC (UOC_Position)**: Professor=1; Associate=2; Assistant=3; Lecturer=4; Instructor=5; Adjunct=6

---
#### Starting off & downloading the dataset
Here I import the all of the necessary libraries and modules required for the entire challenge, and also download and save the dataset as a CSV file.

In [16]:
import module
import requests
import csv
import pandas as pd
from os import remove
from os import path
import numpy as np

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00334/wiki4HE.csv"
r = requests.get(url, allow_redirects = True)
open("originalDataset.csv", "wb").write(r.content)

99358

---
#### Cleaning the data:

In [18]:
pd.read_csv("originalDataset.csv")

Unnamed: 0,AGE;GENDER;DOMAIN;PhD;YEARSEXP;UNIVERSITY;UOC_POSITION;OTHER_POSITION;OTHERSTATUS;USERWIKI;PU1;PU2;PU3;PEU1;PEU2;PEU3;ENJ1;ENJ2;Qu1;Qu2;Qu3;Qu4;Qu5;Vis1;Vis2;Vis3;Im1;Im2;Im3;SA1;SA2;SA3;Use1;Use2;Use3;Use4;Use5;Pf1;Pf2;Pf3;JR1;JR2;BI1;BI2;Inc1;Inc2;Inc3;Inc4;Exp1;Exp2;Exp3;Exp4;Exp5
0,40;0;2;1;14;1;2;?;?;0;4;4;3;5;5;3;4;4;3;3;2;2;...
1,42;0;5;1;18;1;2;?;?;0;2;3;3;4;4;3;3;4;4;4;3;3;...
2,37;0;4;1;13;1;3;?;?;0;2;2;2;4;4;3;3;3;2;2;2;5;...
3,40;0;4;0;13;1;3;?;?;0;3;3;4;3;3;3;4;3;3;4;3;3;...
4,51;0;6;0;8;1;3;?;?;1;4;3;5;5;4;3;4;4;4;5;4;3;4...
...,...
908,43;0;5;1;21;2;?;?;2;0;3;3;3;5;5;2;4;5;3;3;4;5;...
909,53;0;6;0;25;2;?;?;6;0;3;3;4;5;4;3;4;4;4;4;4;3;...
910,39;0;5;1;9;2;?;?;4;0;3;3;3;5;4;3;3;4;3;3;2;5;2...
911,40;0;3;1;10;2;?;?;2;0;3;3;5;5;4;2;4;4;4;4;3;2;...


As you can see, the data in the file from the UCI repository is messy and hard to read. In particular, the formatting is not consistent with csv format, and as such will be hard to use without cleaning.

In [24]:
if path.exists("cleanDataset.csv"):
    remove("cleanDataset.csv")

with open('originalDataset.csv', newline = '') as ogDataset:
    reader = csv.reader(ogDataset)
    first_iteration = True
    with open('cleanDataset.csv', 'w') as dataset:
        for row in reader:
            line = row[0].split(";")
            filewriter = csv.writer(dataset, delimiter=',')
            filewriter.writerow(line)
        

In [25]:
pd.read_csv("cleanDataset.csv")

Unnamed: 0,AGE,GENDER,DOMAIN,PhD,YEARSEXP,UNIVERSITY,UOC_POSITION,OTHER_POSITION,OTHERSTATUS,USERWIKI,...,BI2,Inc1,Inc2,Inc3,Inc4,Exp1,Exp2,Exp3,Exp4,Exp5
0,40,0,2,1,14,1,2,?,?,0,...,3,5,5,5,5,4,4,4,1,2
1,42,0,5,1,18,1,2,?,?,0,...,2,4,4,3,4,2,2,4,2,4
2,37,0,4,1,13,1,3,?,?,0,...,1,5,3,5,5,2,2,2,1,3
3,40,0,4,0,13,1,3,?,?,0,...,3,3,4,4,3,4,4,3,3,4
4,51,0,6,0,8,1,3,?,?,1,...,5,5,5,4,4,5,5,5,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
908,43,0,5,1,21,2,?,?,2,0,...,2,2,2,2,2,?,?,?,?,?
909,53,0,6,0,25,2,?,?,6,0,...,4,4,3,3,4,4,4,4,1,1
910,39,0,5,1,9,2,?,?,4,0,...,2,5,4,3,?,5,5,5,4,1
911,40,0,3,1,10,2,?,?,2,0,...,5,1,5,2,2,4,4,2,1,1
