# Lab Assignment 4
>This is part-1 of a three part project where we are developing a Dashboard for prospective MSIS students, as well as recruiters. For this session, we are obtaining 3 sets of data from the web, cleaning and reformatting them, and finally producing prototypes for each.

## Dataset #1: Linkedin Group
The data is initially scraped from __[here](https://www.linkedin.com/school/165847/alumni/?facetFieldOfStudy=101468)__ by selecting, copying and formatting the data into csv file. A table is created in Tableau:<br>

<img src="files/images/table4.1.png"width="40%"><br>

*Advantages:* This graph gives prospective students a quick look at which big companies alumnis have landed jobs in; some of the reputations of the companies listed (google, Facebook, Apple, etc.) might give students the impression that the MSIS program will gear them with the skills needed to obtain jobs in those companies. The different sizes of the circles will give them a rough estimation about distribution of the actual numbers that work in that organization.<br>

*Disadvantages:* There are some disadvantages to this visualization. First, the data is skewed; Linkedin does not show the entire list of companies which the Alumnis work at, and this also only represent those that are registered on Linkedin and identifies them selves as a MSIS Alumni; there are plenty of other Alumnis out there who are not considered for this data. Second, the actual numbers are based only on those who are currently employed; if the user clicks on 'Apple', a circle with considerable size, they will find that only 11 people work there. However, the small pool in which the data was taken gives the wrong impression that a high percentage of alumnis work there. Finally, there are many circles which have similar sizes; this will be difficult for comparison of the different companies without putting in any numbers or percentages.<br>

*Possible Improvements:* Without getting into the flaws of the dataset, I can improve the visualization by adding more information on to the chart; I can clarify that this was taken from a pool of around 500 alumnis, and is further deducted down to a little 1/5th of that size (Linkedin will only show top 15 records). I can also chose to show actual values in the circles to give viewers a better sense of what the actual size of the population who work for the different organizations. 








## Dataset #2: Top Skills on Linkedin
I've decided to use this set of __[data](https://www.tableau.com/about/blog/2017/4/flex-your-data-skills-makeover-monday-68584)__, since Linkedin has become one of the top source where students look for jobs, as well as a place for recruiters to find potential candidates for their company. For the past 3 years (2014, 2015, 2016), Linkedin did a study based on member profiles and skills which get them attention from recruiters. This dataset was downloaded in csv format and will be cleaned for country of interest (USA), and eliminated down to 'top 10 skills'.

In [107]:
import pandas as pd

# load data
df = pd.read_csv('LinkedIn Top Skills.csv', delimiter=',')
df.head()

Unnamed: 0,Skill,Country,Year,Rank
0,Algorithm Design,Australia,2015,13
1,Algorithm Design,Australia,2016,9
2,Business Intelligence,Australia,2014,6
3,Business Intelligence,Australia,2015,21
4,C/C++,Australia,2014,17


In [103]:
# filter by country and rank.

df2 = df[(df.Country == "United States") & (df['Rank'] < 11)]
df2.head()


Unnamed: 0,Skill,Country,Year,Rank
739,Algorithm Design,United States,2015,8
740,Algorithm Design,United States,2016,9
744,Cloud and Distributed Computing,United States,2014,1
745,Cloud and Distributed Computing,United States,2015,1
746,Cloud and Distributed Computing,United States,2016,1


Used Beautiful Soup to scrape MSIS course information from SCU website;the course titles are then copied and pasted into csv file. Extra lines are deleted in Excel. Some issues I had with this step were that this was a relatively old page; some new courses such as Machine Learning were not added; I had to add them manually later. It was difficult to scrape data from the more updated version of the course description, as the course titles are all enclosed in a collapsed table.

In [102]:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://www.scu.edu/business/graduates/bulletin/programs-and-admissions/course-descriptions/master-of-science-in-information-systems/').read()
soup = bs.BeautifulSoup(sauce,'lxml')

for tag in soup.find_all("strong"):
    print(tag.text)
    

Course Numbering Key:
2XXX – Courses for M.S. students only3XXX – Courses for MBA students
Refer to previous year bulletin for 3-digit course information (with exception of MSIS and AMBA courses which are included here).
MSIS 2601. Object-Oriented Analysis and Programming
MSIS 2602. Information Systems Analysis and Design
MSIS 2603. Database Management Systems
MSIS 2604. Information Systems Policy and Strategy
MSIS 2605. Telecommunications and Business Networks
MSIS 2606. Software Project Management 
MSIS 2621. Business Intelligence and Data Warehousing 
MSIS 2622. ERP Systems
MSIS 2623. Financial Information Systems
MSIS 2624. E-Business Technologies - Virtualization
MSIS 2625. Information Security Management
MSIS 2626. Computer Simulation and Modeling
MSIS 2627. Big Data Modeling and Analytics 
MSIS 2628. The Business of Cloud Computing
MSIS 2629. Dashboards, Scorecards, and Visualization 
MSIS 2630. Web Programming
MSIS 2640. Capstone Project Proposal
MSIS 2641. Information Technolo

I merged the Linkedin Skill Rank data with the course information scraped by appending appropriate course(s) next to the skill; this might show interested students which course will give them the skills ranked. The results are plotted on Tableu.



In [105]:
# load data
skills = pd.read_csv('skills_courses.csv', delimiter=',')
skills.head()

Unnamed: 0,Skill,Country,Year,Rank,Corresponding Course
0,Cloud and Distributed Computing,United States,2014,1,"MSIS 2628. The Business of Cloud Computing, MS..."
1,Cloud and Distributed Computing,United States,2015,1,"MSIS 2628. The Business of Cloud Computing, MS..."
2,Cloud and Distributed Computing,United States,2016,1,"MSIS 2628. The Business of Cloud Computing, MS..."
3,Statistical Analysis and Data Mining,United States,2014,2,"MSIS 2627. Big Data Modeling and Analytics, MS..."
4,Statistical Analysis and Data Mining,United States,2015,2,"MSIS 2627. Big Data Modeling and Analytics, MS..."


This is the resulting table:<br>
<img src="files/images/table4.2.png"width="70%"><br>

*Advantages:* This visualization shows a clear trend in rankings of the different skills, as well as how the MSIS program is related to the skills listed; if a prospective student is interested in a particular skill, or a specific job listed on Linkedin in mind, they can see clearly from the tool-tip which MSIS class they should definitely consider taking in the future if they want to study the topic futher. Recruiters of jobs with those skills could also see what classes are offered at SCU and maybe click provided links to see further course description; they will get a better sense of how the student will acquire the skill and keep this in mind when they recruit students from the program.<br>

*Disadvantages:* A major disadvantage is that the mark labels do not conform with the data set; it cannot rotate to fit the data points thus making the visualization looking very cluttered. Also, it is not apparent without moving the cursor onto the skills that there are specific courses tailored towards that skill. Another is that although there are very specific courses designed with the skills in mind, there are also courses that do not explicitly refer to the skill in the course title: object-oriented programming could also refer to Python, capstone design projects would also give the student algorithm design skills, etc. Some courses did not match to any skill, but they do give the student the foundations of the listed skills. For example, visualization course could also be very helpful in the process of acquiring the skill of statistical analysis. Another disadvantage is that the data is only limited to the US; if students are looking to get a job in another country, the ranking would look different and some courses may not carry (Ethics, policy & stratege).<br>

*Improvements:* I could improve this visualization by providing links to course description for further reading, I could also put a mark label on each skill indicating that there are a number of specific courses related to that skill.



## Dataset #3: Increasing Number of Jobs in Data
I've decided to use this set of __[data](https://www.bls.gov/oes/tables.htm)__ from Bureau of Labor Statistics, for the years 2014, 2015, and 2016. These are excel tables directly downloadable from the website. The tables were cleaned up here and saved to new csv file.

In [129]:
# load data
lr = pd.read_csv('laborstat14.csv', delimiter=',')

# Eliminate unwanted columns, filter by column and job title.
lr['YEAR'] = "2014"
lr = lr[['YEAR','OCC_CODE','OCC_TITLE','TOT_EMP','A_MEAN']]


lr1 =lr.loc[lr['OCC_CODE'].isin(["15-1120","15-1140"])]




In [128]:
# load data
lb = pd.read_csv('laborstat15.csv', delimiter=',')
lb['YEAR'] = "2015"
# Eliminate unwanted columns, filter by column and job title.
lb = lb[['YEAR','OCC_CODE','OCC_TITLE','TOT_EMP','A_MEAN']]


lb1 =lb.loc[lb['OCC_CODE'].isin(["15-1120","15-1140"])]


In [127]:
# load data
le = pd.read_csv('laborstat16.csv', delimiter=',')
le['YEAR'] = "2016"
# Eliminate unwanted columns, filter by column and job title.
le = le[['YEAR','OCC_CODE','OCC_TITLE','TOT_EMP','A_MEAN']]


le1 =le.loc[le['OCC_CODE'].isin(["15-1120","15-1140"])]

In [124]:
# Append all data to same table and write to file; data plotted on Tableau.

lrbe = lr1.append([lb1,le1])
lrbe.to_csv('lab4laborStat.csv',index=False)

This is the resulting table:<br>

<img src="files/images/table4.3.png"width="60%"><br>

*Advantages:* This graph shows a clear upward trend for both the total employment available and the mean annual salary for the two selected job positions. The viewer will quickly understand the argument I am tryng to communicate.<br>

*Disadvantages:* There are very little data points selected for this visualization, making the argument somewhat weak. The Bureau of Labor is also dated in their description of data-related jobs; data analysis is so prevalent throughout every industry that it should be be restricted only to computers and mathematics; they could be used in education, financial organizations etc. To counter this, I have selected two very broad  fields, which include positions most often is associated with data analysts/data scientists.<br>

*Improvements:* I could provide links to the specific title descriptions to give the audience a clear idea of what these jobs are.

## Discussion

My original claim for the datasets collectively was that the MSIS program will gear students with the current most demanding skill on the job market, and the 3rd visualization also expressed that jobs related to data management and data analysis is on the rise, making the MSIS program an ideal place to be for student aiming to jump-start their career. However, there is a assumption that students who joined and finishes the MSIS program will be well-geared to join the job hunt, but how to get those high-demand jobs (such as those in FLAG companies) will be dependent on their individual abilities. To make my claim stronger, I could find further data supporting the fact that alumnis are getting these positions fast (how fast they obtain a job), and positions in relation to the high demanding skills (dataset #2) are filling up fast.


### Resources Used:

1. MSIS Alumni Work Info/Skillset: __[Linkedin Leavy Alumni Group](https://www.linkedin.com/school/165847/alumni/?facetFieldOfStudy=101468)__<br>
2. Top Skills on Linkedin: __[Makeover Monday Blog Post](https://www.tableau.com/about/blog/2017/4/flex-your-data-skills-makeover-monday-68584)__<br>
3. Occupational Employment Statistics - __[Bureau of Labor Statistics](https://www.bls.gov/oes/tables.htm)__<br>
4. MSIS Curriculum Info: __[List of Courses](https://www.scu.edu/business/graduates/bulletin/programs-and-admissions/course-descriptions/master-of-science-in-information-systems/)__<br>
5. Beautiful Soup Documentation:__[Link](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag)__<br>
6. Beautiful Soup Tutorial Series: __[Link](https://www.youtube.com/watch?v=aIPqt-OdmS0&list=PLQVvvaa0QuDfV1MIRBOcqClP6VZXsvyZS)__<br>