# TOP 5 hard skills you should acquire as Data Analyst

In [None]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

**Problems:**

* Which states offers most jobs for Data Analysts?
* What kind of experience is requested?
* What skills do Data Analyst require?


# Load and clean the data

In [None]:
data = pd.read_csv('../input/data-analyst-jobs/DataAnalyst.csv', index_col=0)
data.head()

In [None]:
# Clean Salary columns. Remove '(Glassdoor est.\)'
data['Salary Estimate'] = data['Salary Estimate'].str.replace(' \(Glassdoor est.\)|\$|K', '')


# Split Expected Salary column to MIN and MAX values
data = pd.concat([data.drop('Salary Estimate', axis=1), data['Salary Estimate'].str.split("-", expand=True).rename({0:'Min Expected Salary', 1:'Max Expected Salary'}, axis=1)], axis=1)


# Extract state from `location` column
data['State'] = data['Location'].str[-2:]


# Drop location column
data.drop('Location', axis=1, inplace=True)


# Drop other useless columns
data.drop(['Company Name', 'Headquarters','Type of ownership','Competitors','Easy Apply'], axis=1, inplace=True)


# Convert variable `Company Size` to categorical
data['Company Size'] = pd.Categorical(data['Size'],
               categories=['1 to 50 employees', '51 to 200 employees', '201 to 500 employees', '501 to 1000 employees', '1001 to 5000 employees','5001 to 10000 employees', '10000+ employees'],
               ordered=True)

data.drop('Size', axis=1, inplace=True)


# Drop rows with missing values in `sector` and `founded`
data = data[~data['Sector'].isin(["-1"])]

data = data[~data['Founded'].isin(["-1"])]

# Calculate age for each company
data['Company age'] = 2020 - data['Founded']


# Convert variable `Revenue` to categorical
data['Revenue'] = data['Revenue'].str.replace("\$| \(USD\)", "")

data['Company Revenue'] = pd.Categorical(data['Revenue'],
               categories=['Less than 1 million', '1 to 5 million', '5 to 10 million', '10 to 25 million', '25 to 50 million', '50 to 100 million', '100 to 500 million', '500 million to 1 billion', '1 to 2 billion', '2 to 5 billion', '5 to 10 billion', '10+ billion', 'Unknown / Non-Applicable'],
               ordered=True)

data.drop('Revenue', axis=1, inplace=True)

# Three insights from the data

## 1. There are three hubs in USA where the most job ads come from - CA, TX, NY

In [None]:
data.groupby('State')['Job Title'].count().sort_values(0, False).plot.bar(color='k');

Data Analysts are exceptionally demanded in California where most of biggest starups are located. Texas takes a second place with about 250 job ads. Based on CNBC report (https://www.cnbc.com/2019/07/10/these-are-americas-top-states-for-business-in-2019.html) and Kauffman indicator (picture bellow), Texas are one of the best place to start a company. 

In [None]:
from IPython.display import Image
Image("../input/new-business/kauffman-indicators-chart.png", width=1000)

## 2. The background in IT or Business Services will benefit you.

In [None]:
sns.barplot(x='index', y='Sector', data=data['Sector'].value_counts(normalize=True).head(10).reset_index(), palette='gray')
plt.ylabel('Share of total ads')
plt.xlabel("Sectors")
plt.title("Share of total ads by the Sector", fontdict={'size':14, 'weight':'bold'})
plt.xticks(rotation=90);

Information technology and Busiess Services adds up to 55% of total jobs offerings. Following by the finance and health care. These are the sectors which has the biggest demand on Data Analysts.  

# 3. SQL and Excel are the most requested skills for Data Analyst


I have taken a look on some ads and manualy stores skills to a dictionary.

After that I had counted how often these skills appear on the ads.

In [None]:
# Create a dict of skills as keys and search patterns as values
hard_skills_dict = {
    'Python': r"python",
    'R': r"[\b\s/]r[\s,\.]",
    'Excel': r'excel', 
    'Tableau': r'tableau', 
    'SQL': r'sql', 
    'SAS': r'\bsas\b',
    'SPSS': r'\bSPSS\b',
    'VBA': r'\bvba\b',
    'PowerBI': r'power[\s]BI',
    'PowerQuery': r'power[\s]query',
    'SAP': r"\bSAP\b",
    'AWS': r"\bAWS\b",
    'Git': r"\bGit",
    'Dashboard': r"\bDashboard[s]",
    'Spark': r'Spark',
    'Scala': r'Scala',
    'Matlab': r'Matplotlib',
    'C# or C++': r"\bC[#\+\+]", 
    'Java': r'Java',
    'BigQuery': r"Big[\s]Query",
    'Plotly': r'Plotly',
    'Looker': r'Looker',
    'PowerPivot': r'Power[\s]Pivot',
    'Oracle': r'oracle',
    'UNIX': r'unix',
    'Linux': r'linux'
}

In [None]:
hard_skills = {}

# Loop through skills, and count the frequency
for key, search in hard_skills_dict.items():
    hard_skills[key] = data['Job Description'].str.contains(search, flags=re.IGNORECASE).sum()

    
# Build a DataFrame of skills, counts and frequencies.
skills = pd.DataFrame.from_dict(hard_skills, orient='index').reset_index().rename({'index':'skill', 0:'count'}, axis=1).sort_values('count', 0, False)
skills['freq'] = skills['count'] / data.shape[0]

In [None]:
# Plot a barchart of skills
plt.figure(figsize=(20, 6))
sns.barplot(x='skill', y='freq', data=skills, palette='gray')
plt.xticks(rotation=45)
plt.title("How many times was the skill written?", fontdict={'size':14, 'weight':'bold'})
plt.ylabel("")
plt.xlabel("");

SQL and Excel are the most in-demand hard skills, followed by Python/R and data visualization tools like Tableau.

# Portrait of Data Analyst

The best chance to find a job as Data Analyst have a person living in California with background in IT and good knowledge of Excel and SQL