![Workforce](https://media.licdn.com/media/gcrc/dms/image/C4E12AQFh8lR4q1oi0w/article-cover_image-shrink_600_2000/0?e=2123092800&v=beta&t=pZz7F6dquEe6l2NEx1SvvOzKCt_tbC0GHqtOyVm9b8k)
# <center>Silicon Valley</center>

## Why Care

Women today hold only about a quarter of U.S. computing and mathematical jobs—a fraction that has actually fallen slightly over the past 15 years, even as women have made big strides in other fields. Women not only are hired in lower numbers than men are; they also leave tech at more than twice the rate men do. It’s not hard to see why. Studies show that women who work in tech are interrupted in meetings more often than men. They are evaluated on their personality in a way that men are not. They are less likely to get funding from venture capitalists, who, studies also show, find pitches delivered by men—especially handsome men—more persuasive. And in a particularly cruel irony, women’s contributions to open-source software are accepted more often than men’s are, but only if their gender is unknown.

In short there are social issues in the world of technology, and I intend to address the gender gap issue by exploring the top 23 silicon valley gender data sets during 2016. 



## **The Data** 

There are six columns in this dataset:

**company:** Company name

**year:** For now, 2016 only

**race:** Possible values: "American_Indian_Alaskan_Native", "Asian", "Black_or_African_American", "Latino", "Native_Hawaiian_or_Pacific_Islander", "Two_or_more_races", "White", "Overall_totals"

**gender:** Possible values: "male", "female". Non-binary gender is not counted in EEO-1 reports.

**job_category:** Possible values: "Administrative support", "Craft workers", "Executive/Senior officials & Mgrs", "First/Mid officials & Mgrs", "laborers and helpers", "operatives", "Professionals", "Sales workers", "Service workers", "Technicians", "Previous_totals", "Totals"

**count:** Mostly integer values, but contains "na" for a no-data variable.

## Plotly and SNS Visulizations

This will also be a crash course using one of the emerging data visualation libraries [Plotly](https://plot.ly/). Be prepared to learn about these 3 types of techniques: 
* Pie Charts
* Bar Graphs Vertical and Horizontal
* Multivariate Bar Graphs



## Steps to Success
### Step 1 | Create our gender data frame
### Step 2 | Remove impossible values from count column
### Step 3 | Total employee count
### Step 4 | Bar graph with sns plot of total employee count
### Step 5 | Exploring gender data with plotly pie charts :)
### Step 6 | Cluster Bar Chart with Plotly
### Step 7 | Male to Female Ratios Calculation
### Step 8 | Visualize the Female to Male Ratio 




# Load in Libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import missingno as msno
pd.options.mode.chained_assignment = None

from IPython.display import HTML

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.graph_objs import *
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

<!-- toc -->
# Step 1 | Create our gender data frame
* Read the csv data into our gender data frame
* Preview Data by looking at the head of the data frame

In [27]:
#gender_data=pd.read_csv('Reveal_EEO1_for_2016.csv')
gender_data=pd.read_csv('../input/Reveal_EEO1_for_2016.csv')
gender_data.head()

# Step 2 | Remove Impossible Values
* Convert the count column into an numeric type. 
* Replace all na values in the count column to 0 (can't do math if we have na text involved :) ) 

In [28]:
gender_data['count'].replace(to_replace='na',value=0,inplace=True)
gender_data['count']=gender_data['count'].astype(int)
gender_data.head()

# Step 3 | Total employee Count
How many people works for the top 15 silicon valley companies in the bay. We must group the columns by companies and count the sum of all employees from the different companies. 

In [29]:
#using lambda to aggregate all of the count data from the different type of employees that work at the 15 Silicon Valley Business
#under exploration
company_count=gender_data.groupby(['company']).agg({'count': lambda x: sum((x).astype(int))})
company_count.head()


# Step 4 | Bar Graph with SNS Bar Plot
We explore the employee count data by using an sns style and matplotlib bar graphs. 
* Created our figure size with plt.figure
* Setup our style for graph I used 'WhiteGrid' so the reader can see the grid lines, but feel free to use "White" instead if you want the maximum amount of whitespace
* Indicate the x axis values, y axis values, and the type of color pallete desire to demonstrate your data

Sns Bar Plot [Docs](https://seaborn.pydata.org/generated/seaborn.barplot.html)

# What is a Bar Plot 
A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars. Bar plots include 0 in the quantitative axis range, and they are a good choice when 0 is a meaningful value for the quantitative variable, and you want to make comparisons against it. In our instance we are working with categorical data(names of companies) vs. the total count of the number of employees.

# Note
It is also important to keep in mind that a bar plot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.

# Areas Under Research 
* X-axis Jumbled String Formatting
* Y-axis Number Formatting  
* Pallete



In [30]:
#using figure to create a large size for our viewing purposes 
plt.figure(figsize=(10,8))

#using whitegrid to identify grid lines in the bar graphs
#sns.set_style('whitegrid')

#key to creating bar plot line

sns.barplot(x=company_count.index.get_values(),y=company_count['count'])

plt.title('Silicon Valley Companies',size=25)
plt.ylabel('Number of employees',size=14)
plt.xlabel('Companies',size=14)
plt.show()

# X-Axis Ticks Rotate Jumbled Words
In our above graph one of the issue we run into is that the X-Axis row does not have enough space for all of the 23 companies. The problem is a spacing problem on the X-Axis. We are going to update the plot figure xticks paramter of updating the string information. Handle storing and drawing of text in window or data coordinates. 

> plt.xticks()

We are going to update two key text parameter the size and the rotation parameters. Currently both of these are the default position. We will reduce the size down to 14, and rotate the string text by 90.  

> plt.xticks(size=14,rotate=90)

Read [Docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xticks.html) for more info.

In [31]:
#using figure to create a large size for our viewing purposes 
plt.figure(figsize=(10,8))

#key to creating bar plot line
sns.barplot(x=company_count.index.get_values(),y=company_count['count'])

plt.title('Silicon Valley Companies',size=25)
plt.ylabel('Number of employees',size=14)
plt.xlabel('Companies',size=14)

#Updated X-Ticks to enable readble categories
plt.xticks(size=14,rotation=90)
#sns.despine()
plt.show()

# Note
### Now that the categories are properly rotated by 90 degrees we can finaly see all of the companies names. This dramatic change in visualization only costed us one line of code. But another area of interest is updating the Y-axis ticks. The numbers are very hard to conceive if there are a 10,000 or 100,000. So let's update the string formatting in a similar fashion as we did for X.


# Step 5 | Exploring Gender Data with Plotly Pie Charts :)
Let's explore the gender data of these silicon valley offices with a pie chart to see the differences.

Pie charts are generally used to show percentage or proportional data and usually the percentage represented by each category is provided next to the corresponding slice of pie. Pie charts are good for displaying data for around 6 categories or fewer. Luckily for us we only have two data types two explore which are male and females employee count at their different companies. Below we are highlighting the differences between male to female raitos.

* Using the gender column to count all of the different employee values 
* Use labels variable to store the gender values of the different labels 
* Use trace variable to concatenate the label name( male or female), count of the data, and percentages of the genders

Review [Documents](https://plot.ly/python/pie-charts/) for detail view of how to create your better pie charts


## Basic Pie Chart

In [8]:
#Data variables 
labels = gender_data.groupby(['gender']).agg({'count':sum}).index.get_values()
values = gender_data.groupby(['gender']).agg({'count':sum})['count'].values

#add data to pie chart 
trace = go.Pie(labels=labels, values=values)
fig = [trace]
iplot(fig, filename='Pie Chart of Female and Male Employees')

# Note 
The missing items in the Bie Chart. We don't have the following: 

* Title Name 
* Categories are hard to follow in the chart

## Style Pie Chart

* Update Colors 
* Update Title 
* Update Chare Category Text Info Chart

### Colors 
We must add two parameters to our trace function. We have to add a specific color we want let's choose blue and green. 
  
    trace = go.Pie(... marker=dict(colors=colors, line=dict(color='#0000000', width=2)

### Title 
We also must update the title simply by adding the filename parameter to iplot.

    layout=go.Layout(title='Pie Chart of Female and Male Employee')
    
### Chart Category Text Info
We must add the parameter of text info parameter to the trace object, and concatentate the label and the percent parameter.

    trace = go.Pie(labels=labels, values=values,
               textinfo="label+percent",...)
               

In [9]:
labels = gender_data.groupby(['gender']).agg({'count':sum}).index.get_values()
values = gender_data.groupby(['gender']).agg({'count':sum})['count'].values
#update for the colors 
colors = ['#a1d99b', '#deebf7']
trace = go.Pie(labels=labels, values=values,
               textinfo="label+percent",
               textfont=dict(size=20),
               opacity=.8,
               marker=dict(colors=colors, 
                           line=dict(color='#ff7f00', width=2)))
layout=go.Layout(title='Pie Chart of Female and Male Employee')
data=[trace]

fig = dict(data=data,layout=layout)
iplot(fig, filename='Pie Chart of Female and Male Employees')

# Note
* Update Colors 
The color changed from orange and blue to light blue and green. To find the colors of your choice use either the exact choice. You should visit [color hexa](http://https://www.colorhexa.com/a1d99b) to find the best color pallates and combination for the optimum aesthically pleasing data visuzalation. 
      
 

## Donut Pie Chart

Let's take our piechart to the next level. By applying visuals that stand out from just a regular pie chart with donut charts. This is a great visualization because one can easily. In this case we have to add a few more parameters to our trace object. 
    
    go.Pie(labels=labels...hole=.3,pull=.1...)

By adding these parameter we will exmplify a new deomonstration on who to visualize the pie chart from a different angle, or by highlighting the major parts of the charts that some may not recognize until there is clear differentation. 

### Hole 
Adds a space in the middle of the chart to give it the donut feature. 

### Pull
Add a space between each category that's apart of the pie chart. Gives pie chart seperation to amplify the different items under observation. 

In [10]:
# We have to update 
trace = go.Pie(labels=labels, values=values,
               textinfo="label+percent",
               textfont=dict(size=20),
               hole=.3, pull=.03,
               marker=dict(colors=colors, 
                           line=dict(color='#00000', width=.5)))
layout=go.Layout(title='Pie Chart of Female and Male Employee')
data=[trace]

fig = dict(data=data,layout=layout)
iplot(fig, filename='Pie Chart of Female and Male Employees')

### Discussion

As we can clearly see from the data we have a huge difference of men to women in the workplace. The delta is 40% let's take a closer look to figure out which companies are leading in diversity. 


# Step 6 | Cluster Bar Chart with Plotly 
 A grouped bar chart, also known as clustered bar graph, multi-set bar chart, or grouped column chart, is a type of bar graph that is used to represent and compare different categories of two or more groups. Because the bar clusters are grouped together we can compare the series of data side by side.

### **Let's take closer look at indivual companies by plotting the number of male and female employee by each of the companies to see the distribution of employees**</h3>


# Basic Group Bar Chart
* Create a trace1, and trace 2 variables to store male and femle information 
* x stores the gender data values of the particular companies
* y stores the count of the eigher the male of female data values of the company of interest

The trace's is the common nomenclature that plotly uses in their api libraries to represent a data set.

In [11]:
d=gender_data.groupby(['gender','company']).agg({'count':sum}).reset_index()
trace1 = go.Bar(
    x=d[d.gender=='male']['company'],
    y=d[d.gender=='male']['count'],
)
trace2 = go.Bar(
    x=d[d.gender=='female']['company'],
    y=d[d.gender=='female']['count'],
)
data = [trace1, trace2]

fig = dict(data=data, layout=layout)
iplot(fig, filename='Distribution of Male and Female Employees by Company')

# Notes
Plotly gives you a huge amount of design for free. 

### Y Ticks 
If we hover over the Y-Axis we will notice that the Plotly automates updating the Scale to improve the readability of the table data.

### X Ticks 
The X-Ticks are rotated without the updates of the X-Ticks label when there are multiple variables in the data series. 

### Hover 
If we hover over a data series the name and target amount pops up, and unlike the SNS or Matplotlib bar plot we don't get this feature. 

### Improvements
Let's add a more specific name besides trace, and update the default color scheme to something more of our style. 

## Stylized Group Chart 
* Colors 
* Titles

### Colors 
Adding color to the group chart is similar to the Piechart with the update of the marker parameter to your trace object. 

    marker=dict(
        color='rgb(158,202,225)'
    )

### Name 
Adding the name to the data series by adding a parameter to the trace object called name: 

    trace1 = go.Bar(
    x=d[d.gender=='male']['company'],
    y=d[d.gender=='male']['count'],
    name='Males',...)

In [12]:
d=gender_data.groupby(['gender','company']).agg({'count':sum}).reset_index()
trace1 = go.Bar(
    x=d[d.gender=='male']['company'],
    y=d[d.gender=='male']['count'],
    name='Males',
    marker=dict(
        color='rgb(158,202,225)'
    )
)
trace2 = go.Bar(
    x=d[d.gender=='female']['company'],
    y=d[d.gender=='female']['count'],
    name='Females',
    marker=dict(
        color='rgb(161,217,155)'
    )
)
data = [trace1, trace2]
layout = go.Layout(
    barmode='group',title='Distribution of Male and Female Employees by Company')


fig = dict(data=data, layout=layout)
iplot(fig, filename='Distribution of Male and Female Employees by Company')

### Notes

We have updated the color of the code to our own color pallete blue and green. We have also added the title of the different categories to this cluster bar chart. One of the difficult part peices to read in this data graph is what are the proportions between the two companies? 

## Discussion
<h3><b> We can see that some of the biggest companies such as Intel, Apple, Cisco, Google have a huge gap between the number of male and female employees. Let us dig a little deeper to see how wide the gap is between female and male employees by exploring the ratio of males and females at each company. So we can see the data by reducing the bias of the total employee count.</b></h3>

# Step 7 | Male to Female Ratios Calculation
* Count the sum of the genders from gender data frame 
* Call the [unstack function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) on the d dataframe to pivot the table based on column names
* Leverage lambda to find the percentages of males and females for the Silicon Valley Companies to create a quick sum and averaging function
* Sort the ratio data in our data in a smallest to greatest (descending order)


In [13]:
d=gender_data.groupby(['company','gender']).agg({'count':sum})
d=d.unstack()
d=d['count']
d=np.round(d.iloc[:,:].apply(lambda x: (x/x.sum())*100,axis=1))
d['Ratio']=np.round(d['male']/d['female'],2)
d.sort_values(by='Ratio',inplace=True,ascending=False)
d.columns=['Female %','Male %','Ratio']

In [14]:
d

#### Even big companies such as <b>Nvidia, Intel, Cisco, Uber, Google, Apple, Facebook </b> and many more have more than <b>2 male employees for each female employee.</b>
#### <b>Nvidia</b> seems to have the largest female to male ratio out of all the 23 silicon valley companies  <b>with almost 5 men for each female employee</b>

# Step 8 | Let's Visualize the Female to Male Ratio Data | Horizontal Bar Chart
* We are leveraging plotly to visualize the ratio data with a horizontal bar chart
* Notice that we are using orientation in our trace object to denotate that this chart is meant to be horizontal

A 100% stacked column chart is an Excel chart type meant to show the relative percentage of multiple data series in stacked columns, where the total (cumulative) of stacked columns always equals 100%.

More details on horizontal bar graphs with plotly can be found in [Docs!](https://plot.ly/python/bar-charts/)


In [37]:
trace1 = go.Bar(
    y=d.index.get_values(),
    x=d['Ratio'],text=d['Ratio'],textposition='auto',
    orientation='h',
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    barmode='group',title='Ratio of Male to Female Employees')

fig = dict(data=data, layout=layout)
iplot(fig, filename='Ratio of Male to Female Employees')

## Basic Stack Bar Charts
A stacked bar graph (or stacked bar chart) is a chart that uses bars to show comparisons between categories of data, but with ability to break down and compare parts of a whole. Each bar in the chart represents a whole, and segments in the bar represent different parts or categories of that whole. In our scenario we are stacking the male and female categories to visualize the contrast of the data segements. 

We are going to use the Plotly stacked bar chart framework with a similar setup of the cluster bar charts with the usage of two trace objects, but we are updating our layout object barmode parameter with the string 'stack'. 

    layout = go.Layout(barmode='stack')
    
 By updating our layout parameter this allows us to update our fig object that we display below to be a stacked bar chart. 
 
     fig = go.Figure(data=data,layout=layout)
     

In [16]:
trace1 = go.Bar(y=d['Female %'],
               x=d.index.get_values())

trace2 = go.Bar(y=d['Male %'],
               x=d.index.get_values())

data = [trace1, trace2]

layout = go.Layout(barmode='stack')
fig = go.Figure(data=data,layout=layout)

iplot(fig, filename="Normalize percentage of male and female data")

## Notes
We can see a trendline that 23andMe an airbnb based on their total employee count has the most balance in terms of gender differences. But what we don't see are the actual label numbers and percentages. Let's add some more style to graph so we can see the different data at 

# Style 

## Chart Info
We are updating the data inside of the chart with our thematic colors that we have used throughout this tutorial (blue and green). Let's also add some the percentage of each column.

## Key Difference 2 
We added the percentages of male and female ratios so that we can clearly see the differences.  We also added color inside of our marker object. 
        
        text=d['Male %'],
    textposition='auto',
    marker=dict(
        color='rgb(161,217,155)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
    

In [36]:

trace1 = go.Bar(
    y=d['Female %'],
    x=d.index.get_values(),
    text=d['Female %'],
    textposition='auto',
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)
    

trace2= go.Bar(
    y=d['Male %'],
    x=d.index.get_values(),
    text=d['Male %'],
    textposition='auto',
    marker=dict(
        color='rgb(161,217,155)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1, trace2]
#stacks the data on top of each other 
layout = go.Layout(
    barmode = 'stack'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename="Normalize Percentage of male to female data")


![](https://media.licdn.com/dms/image/C4E12AQElIcntLKXDNg/article-inline_image-shrink_1500_2232/0?e=2123110800&v=beta&t=sjk8VHF38xSNxdt5wWTwOZZIEI6SjVoHZAP0E0qt_QI)
# <center> Conlcusion | Hope </center>

In the past several years, Silicon Valley has begun to grapple with these problems, or at least to quantify them. In 2014, Google released data on the number of women and minorities it employed. Other companies followed, including LinkedIn, Yahoo, Facebook, Twitter, Pinterest, eBay, and Apple. The numbers were not good, and neither was the resulting news coverage, but the companies pledged to spend hundreds of millions of dollars changing their work climates, altering the composition of their leadership, and refining their hiring practices.

#### Project Next Steps

In part two of the tutorial I will explore the race gap numbers, and how to visualize this information as well.