

<img src="https://i.imgflip.com/14vp9i.jpg" width="400" style="float: right; margin: 50px">

# GROUP EDA

Today we will be splitting into 4 groups and presenting our findings the last hour of class.

The goals for todays activity include:

- Defining a problem statement as a group
- Exloring datasets and practicing exploratory analysis 
- Good use of validation
- Communicating results succinctly

You might consider preparing a brief slide deck, but it's not necessary.  We created [these great guidelines](https://github.com/ga-students/DSI-SF-1/wiki/Presentation-Guidelines) to help.

## Deliverables

- 10 Minute presentation, 5 minutes question and answer
- Validation of results

### Optional Feedback

After presentations, please give feedback to each other.

- Peer feedback:  [Use this form](http://goo.gl/forms/ybRIcrwYrVVV4hhR2).  All responses are shared.  Be nice ;)

### Suggestions
- Appoint someone to present / organize
- Look at summary statistics, explore data
- Refine problem statement
- Divide and conquer your workload
- Don't fight - you will be held accountable for presenting _something_.
- You will present


## Group 1:  WOW

<img src="https://snag.gy/tZXe0N.jpg" width="500">


Using [this dataset](https://www.kaggle.com/mylesoneill/warcraft-avatar-history/downloads/warcraft-avatar-history.zip), we would like you to figure out if the following is possible:
 
1. Predict when a user has "churned" or will churn.  We want to know about churn!  What is this about, you will have to research this idea if it is unfamilliar to you.  We want that \$\$\$\$ so help us deal with churn!<br><br>

1. Also, we want to know more about zones and player classes.  We have a hunch something good is happening there.<br><br>

1. Is there anything else we can do to optimize gameplay?<br><br>

1. We're thinking about making yet another expansion.  Are there any themes we might consider based on player behavior *ie: which features are important to consider about current player preferences to ______*?

** Projected Challenges **
- Research
- Problem statement expected to be much more defined
- There are a lot of areas to explore and focus may be difficult without a plan

## Group 2: GOT

<img src="https://snag.gy/8uoD9y.jpg" width="500">

Using [this dataset](./assets/datasets/character-predictions.csv), we would like to know which factors are most important predictors of mortality within our population.  From what we know, this dataset was scraped from [this website](http://awoiaf.westeros.org/index.php/Main_Page) and may provide pointers on the definition of the dataset.

**Projected Challeneges**
- Some munging
- Regression and classification type problems possible
- Data is not as cut and dry as it seems

## Group 3:  Last Words

<img src="https://snag.gy/MElcUY.jpg" width="200">

Using [this dataset](https://docs.google.com/spreadsheet/ccc?key=0ArNsipRBvi69dEUxZHVuRTc4ZlctREdldExsOW5rMUE#gid=0), we would like you to approach this as a study that might inform policy in how law is applied.  The first step in this process, let's assume, is a high level report that illustrates any commonalities and patterns found in this dataset
:

- Common patterns found in last words of inmates
- Commonality of words to other features in dataset
- Of lessor importance, how hard would it be to classify religious inclinations
- Sentiment as a feature or frequency (ie: Check out textblob - pip install textblob)
 - Can this be stratified to other features
 - Which words seem to be common in these cases

** Bonus **
Do any POS patterns look interesting?

**Projected Challenges**
 - Problem statement may be challenging but you will need to come up with one
 - This project is a bit more defined but there are vague problems involved
 - There are a lot of vague asks
 - Heavily NLP based problem

## Group 4: Indeed, you need a job

<img src="https://lh5.googleusercontent.com/CJwGV91wfvsEsDqCBjsg2ahGKAoP2_Wwd3JXXBb8yWPlIT676LlBnsIfG2bFUtOxgLb0ba5WJKfzMiVdT9ncQhSmscdcxPZDnWfV6Qkb0LhR_6Natug72_MsYTLDJS34vg" width="500">

You need a job.  There are lots of jobs with the title "Data Scientist".  There are other jobs as well that have the same skillset.  Recently we [scraped together a dataset](assets/datasets/indeed.csv), that is not super clean but will work.

Which jobs aren't labeled like "data science" jobs, but have a lot of the same features found in "data scientist" jobs?

 - Who is hiring the most? (this may be difficult without updating our scraper and getting more data, but we'll say this is optional)
 - Which skills seem more important to big / medium / small companies or less well known?
 - Which keywords are commmon?
 - Which are rare?

**Projected Challenges**
- Labeled data is based on search terms and results may overlap quite a bit
 - Some cleaning
- Great opportunity to practice NLP
- DO NGRAM ANALYSIS OR ELSE!

PLEASE share your notebooks with the rest of class.  This type of analysis will give you an edge in the job market and lead you to other areas to explore / mine / predict.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textblob
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
indeed = pd.read_csv('../Week7/5.2-group_eda/indeed.csv')

In [6]:
indeed.head()

Unnamed: 0,reviews,search_term,title,location,summary
0,18 reviews,business intelligence,Data Warehouse Development Senior Manager,"Mountain View, CA",Proficiency fostering internal and external bu...
1,"1,486 reviews",business intelligence,"Manager, Healthcare Business Intelligence","San Francisco, CA",Experience working with BI tools such as such ...
2,14 reviews,business intelligence,Business Intelligence Developer,"Mountain View, CA",Work with BI Administrator and Architect to le...
3,90 reviews,business intelligence,Business Intelligence Analyst,,\nOur business is growing and we always lookin...
4,11 reviews,business intelligence,Sr Business Intelligence Analyst,,\nWe have a great opportunity available for a ...


In [7]:
subset = ['search_term', 'summary']
indeed = indeed[subset]

In [8]:
indeed.search_term.unique()

array(['business intelligence', 'search_term', 'analytics',
       'software engineer', 'Data Engineer', 'Data Analyst',
       'Data Scientist'], dtype=object)

In [12]:
summary_corpus = ""

for item in indeed.summary:
    summary_corpus += item

In [18]:
summary_corpus = unicode(summary_corpus, errors='replace')

In [22]:
summary_corpus[:500]

u'Proficiency fostering internal and external business relationships. Minimum 5 years\ufffd\ufffd\ufffd experience managing architecture and ETL development and business critical...Experience working with BI tools such as such as Informatica, SAP Business Objects, Cerner PowerInsight, Oracle BI, Dimensional Insight, Cognos Business...Work with BI Administrator and Architect to leverage the existing Business Intelligence tools and bring value to Business. SAP Business Objects....\nOur business is growing and we al'

In [19]:
 en_nlp = spacy.load('en')

In [21]:
summary_ = en_nlp(summary_corpus)

In [48]:
named_ents = summary_.ents

In [49]:
names = list(named_ents)

In [50]:
name_list = []
for item in names:
    name_list.append(item)

In [51]:
name_list

[Informatica,
 SAP Business Objects,
 Cerner PowerInsight,
 Oracle BI,
 Dimensional Insight,
 Cognos Business,
 Business Intelligence,
 MIS,
 Passionate People,
 Python,
 Clarity,
 Rodan + Fields,
 Business Intelligence Business,
 Nearly a thousand,
 Instapage,
 Microsoft Business Intelligence Stack,
 SharePoint,
 Nearly a thousand,
 Instapage,
 Business Intelligence,
 Tableau,
 Business Objects,
 Cognos,
 Business Intelligence,
 4,
 Business Users,
 CEP,
 Operations,
 Business Development,
 Credit Karma,
 Business Intelligence Visualization,
 The Hotwire Business Intelligence,
 2+ years,
 Fitbit���s Business,
 San Francisco,
 CA Longterm Direct W2 Contract,
 USC,
 7+,
 SQL BI Developers,
 2+ years,
 Nearly a thousand,
 Instapage,
 Pleasanton,
 2,
 5,
 Four,
 The Senior Reporting Analyst,
 North America,
 Business Intelligence & Reporting,
 Looker,
 Domo,
 Tableau,
 Business, Finance,
 Business Objects Web Intelligence,
 Business Objects Web Intelligence 4.1,
 Data Science,
 Statistics

In [93]:
named_stuff = pd.DataFrame(name_list)

In [94]:
np.ravel(named_stuff)

array([Informatica, None, None, ..., None, None, None], dtype=object)

In [103]:
names = np.ravel(named_stuff)

In [104]:
ind_names = pd.DataFrame(list(names))

In [113]:
from textblob import TextBlob

testimonial = TextBlob(summary_corpus)

In [114]:
testimonial.sentiment

Sentiment(polarity=0.16428396717715085, subjectivity=0.4528216436867881)