# Activity: Validate and clean your data

## Introduction

In this activity, you will use input validation and label encoding to prepare a dataset for analysis. These are fundamental techniques used in all types of data analysis, from simple linear regression to complex neural networks. 

In this activity, you are a data professional an investment firm that is attempting to invest in private companies with a valuation of at least $1 billion. These are often known as "unicorns." Your client wants to develop a better understanding of unicorns, with the hope they can be early investors in future highly successful companies. They are particularly interested in the investment strategies of the three top unicorn investors: Sequoia Capital, Tiger Global Management, and Accel. 

## Step 1: Imports

Import relevant Python libraries and packages: `numpy`, `pandas`, `seaborn`, and `pyplot` from `matplotlib`.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Load the dataset

The data contains details about unicorn companies, such as when they were founded, when they achieved unicorn status, and their current valuation. The dataset `Modified_Unicorn_Companies.csv` is loaded as `companies`, now display the first five rows. The variables in the dataset have been adjusted to suit the objectives of this lab, so they may be different from similar data used in prior labs. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Run this cell so pandas displays all columns
pd.set_option('display.max_columns', None)

In [3]:
# RUN THIS CELL TO IMPORT YOUR DATA. 
df = pd.read_csv('Modified_Unicorn_Companies.csv')
companies = df[:]
# Display the first five rows.
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,180,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,100,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,95,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,46,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


In [4]:
df.size

10740

## Step 2: Data cleaning


Begin by displaying the data types of the columns in `companies`.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company           1074 non-null   object
 1   Valuation         1074 non-null   int64 
 2   Date Joined       1074 non-null   object
 3   Industry          1074 non-null   object
 4   City              1057 non-null   object
 5   Country/Region    1074 non-null   object
 6   Continent         1074 non-null   object
 7   Year Founded      1074 non-null   int64 
 8   Funding           1074 non-null   object
 9   Select Investors  1074 non-null   object
dtypes: int64(2), object(8)
memory usage: 84.0+ KB


In [6]:
df.rename(columns=({'Date Joined':'Date', 'Country/Region':'Country', 'Year Founded':'Founded', 'Select Investors':'Investors'}), inplace=True)
df.Date = pd.to_datetime(df.Date, format='mixed')
#df.Valuation = df.Valuation.str.strip('$B').astype('float')
df = df.drop(df[df.Funding=='Unknown'].index).reset_index(drop=True)
df.Funding = df.Funding.str.strip('$B').str.strip('M').astype('float') # This is not good
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1062 entries, 0 to 1061
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Company    1062 non-null   object        
 1   Valuation  1062 non-null   int64         
 2   Date       1062 non-null   datetime64[ns]
 3   Industry   1062 non-null   object        
 4   City       1045 non-null   object        
 5   Country    1062 non-null   object        
 6   Continent  1062 non-null   object        
 7   Founded    1062 non-null   int64         
 8   Funding    1062 non-null   float64       
 9   Investors  1062 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(6)
memory usage: 83.1+ KB


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Review what you have learned about exploratory data analysis in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

There is a `pandas` DataFrame property that displays the data types of the columns in the specified DataFrame.
 

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  The `pandas` DataFrame `dtypes` property will be helpful.

</details>

### Modify the data types

Notice that the data type of the `Date Joined` column is an `object`&mdash;in this case, a string. Convert this column to `datetime` to make it more usable. 

In [7]:
# Apply necessary datatype conversions.

### YOUR CODE HERE ###


### Create a new column

Add a column called `Years To Unicorn`, which is the number of years between when the company was founded and when it became a unicorn.

In [8]:
try:
    df.insert(3, 'Year', df.Date.dt.year)
except:
    print('Year already exists.')

In [9]:
try:
    df.insert(4, 'Month', df.Date.dt.month)
except:
    print('Month already exists.')

In [10]:
try:
    df.insert(5, 'Day', df.Date.dt.day)
except:
    print('Day already exists.')

In [11]:
df.head()

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Funding,Investors
0,Bytedance,180,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,2002,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,100,2018-07-03,2018,7,3,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,2.0,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,95,2014-01-23,2014,1,23,FinTech,San Francisco,United States,North America,2010,2.0,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,46,2011-12-12,2011,12,12,Fintech,Stockholm,Sweden,Europe,2005,4.0,"Institutional Venture Partners, Sequoia Capita..."


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Extract just the year from the `Date Joined` column. 

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use `dt.year` to access the year of a datetime object.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

Subtract the `Year Founded` from the `Date Joined`, and save it to a new column called `Years To Unicorn`.
  
Ensure you're properly extracting just the year (as an integer) from `Date Joined`.

</details>

**QUESTION: Why might your client be interested in how quickly a company achieved unicorn status?**

[Write your response here. Double-click (or enter) to edit.]

### Input validation

The data has some issues with bad data, duplicate rows, and inconsistent `Industry` labels.

Identify and correct each of these issues.

#### Correcting bad data

Get descriptive statistics for the `Years To Unicorn` column.

In [12]:
df.head(1)

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Funding,Investors
0,Bytedance,180,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,8.0,"Sequoia Capital China, SIG Asia Investments, S..."


In [13]:
#companies['Years To Unicorn'] = 
(df['Year'] - df['Founded']).describe()

count    1062.000000
mean        7.001883
std         5.321696
min        -3.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
dtype: float64

In [14]:
df[df.Year < df.Founded]

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Funding,Investors
525,InVision,2,2017-11-01,2017,11,1,Internet software & services,New York,United States,North America,2020,349.0,"FirstMark Capital, Tiger Global Management, IC..."


In [15]:
df.loc[df.Year < df.Founded, 'Founded'] = 2011

In [16]:
# Identify and correct the issue with Years To Unicorn.
(df['Year'] - df['Founded']).describe()

count    1062.000000
mean        7.010358
std         5.312912
min         0.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
dtype: float64

In [17]:
df[df.Year < df.Founded]

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Funding,Investors


Now, recalculate all the values in the `Years To Unicorn` column to remove the negative value for InVision. Verify that there are no more negative values afterwards.

In [18]:
df.insert(11, 'Unicorn', df.Year-df.Founded)
df.head(2)

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
0,Bytedance,180,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


#### Issues with `Industry` labels

The company provided you with the following list of industry labels to identify in the data for `Industry`. 

**Note:** Any labels in the `Industry` column that are not in `industry_list` are misspellings.

In [19]:
# List provided by the company of the expected industry labels in the data
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',\
       'Internet software & services','Supply chain, logistics, & delivery', 'Consumer & retail',\
       'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation', \
        'Travel', 'Cybersecurity','Mobile & telecommunications']
len(industry_list)

15

**Question: Which values currently exist in the `Industry` column that are not in `industry_list`?**


In [20]:
# Check which values are in `Industry` but not in `industry_list`
df.Industry.unique()

array(['Artificial intelligence', 'Other',
       'E-commerce & direct-to-consumer', 'FinTech', 'Fintech',
       'Internet software & services',
       'Supply chain, logistics, & delivery', 'Consumer & retail',
       'Data management and analytics', 'Edtech', 'Health', 'Hardware',
       'Auto & transportation', 'Travel', 'Cybersecurity',
       'Mobile & telecommunications', 'Data management & analytics',
       'Artificial Intelligence'], dtype=object)

In [21]:
[i for i in df.Industry.unique() if i not in industry_list]

['FinTech', 'Data management and analytics', 'Artificial Intelligence']

In [22]:
[i for i in industry_list if i not in df.Industry.unique()]

[]

In [23]:
set(companies['Industry']) - set(industry_list)

{'Artificial Intelligence', 'Data management and analytics', 'FinTech'}

In [24]:
industry_map = {'FinTech':'Fintech', 
                'Data management and analytics':'Data management & analytics', 
                'Artificial Intelligence':'Artificial intelligence'}
industry_map

{'FinTech': 'Fintech',
 'Data management and analytics': 'Data management & analytics',
 'Artificial Intelligence': 'Artificial intelligence'}

In [25]:
df.Industry = df.Industry.replace(industry_map)

In [26]:
set(df['Industry']) - set(industry_list)

set()

#### Handling duplicate rows

The business mentioned that no company should appear in the data more than once.

Verify that this is indeed the case, and if not, clean the data so each company appears only once.

Begin by checking which, if any, companies are duplicated. Filter the data to return all occurrences of those duplicated companies.

In [27]:
# Isolate rows of all companies that have duplicates
df[df.duplicated(subset=['Company'], keep=False)]

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
384,BrewDog,2,2017-04-10,2017,4,10,Consumer & retail,Aberdeen,United Kingdom,Europe,2007,10,233.0,"TSG Consumer Partners, Crowdcube"
385,BrewDog,2,2017-04-10,2017,4,10,Consumer & retail,Aberdeen,UnitedKingdom,Europe,2007,10,233.0,TSG Consumer Partners
508,ZocDoc,2,2015-08-20,2015,8,20,Health,New York,United States,North America,2007,8,374.0,"Founders Fund, Khosla Ventures, Goldman Sachs"
509,ZocDoc,2,2015-08-20,2015,8,20,Health,,United States,North America,2007,8,374.0,Founders Fund
1019,SoundHound,1,2018-05-03,2018,5,3,Artificial intelligence,Santa Clara,United States,North America,2005,13,215.0,"Tencent Holdings, Walden Venture Capital, Glob..."
1020,SoundHound,1,2018-05-03,2018,5,3,Other,Santa Clara,United States,North America,2005,13,215.0,Tencent Holdings


In [28]:
# Drop rows of duplicate companies after their first occurrence
df[df.duplicated(subset=['Company'], keep='first')]

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
385,BrewDog,2,2017-04-10,2017,4,10,Consumer & retail,Aberdeen,UnitedKingdom,Europe,2007,10,233.0,TSG Consumer Partners
509,ZocDoc,2,2015-08-20,2015,8,20,Health,,United States,North America,2007,8,374.0,Founders Fund
1020,SoundHound,1,2018-05-03,2018,5,3,Other,Santa Clara,United States,North America,2005,13,215.0,Tencent Holdings


In [29]:
df.reset_index(drop=True, inplace=True)

### Convert numerical data to categorical data

Sometimes, you'll want to simplify a numeric column by converting it to a categorical column. To do this, one common approach is to break the range of possible values into a defined number of equally sized bins and assign each bin a name. In the next step, you'll practice this process.

#### Create a `High Valuation` column

The data in the `Valuation` column represents how much money (in billions, USD) each company is valued at. Use the `Valuation` column to create a new column called `High Valuation`. For each company, the value in this column should be `low` if the company is in the bottom 50% of company valuations and `high` if the company is in the top 50%.

In [30]:
df.head(2)

Unnamed: 0,Company,Valuation,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
0,Bytedance,180,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


In [31]:
# Create new `High Valuation` column
# Use qcut to divide Valuation into 'high' and 'low' Valuation groups
pd.qcut(companies['Valuation'], 2, labels = ['low', 'high'])

0       high
1       high
2       high
3       high
4       high
        ... 
1069     low
1070     low
1071     low
1072     low
1073     low
Name: Valuation, Length: 1074, dtype: category
Categories (2, object): ['low' < 'high']

In [32]:
group_names = ['xlo','low','med','hi','xhi']
bins = np.linspace(min(df["Valuation"]), max(df["Valuation"]), len(group_names)+1)
pd.cut(df['Valuation'], bins, labels=group_names, include_lowest=True).head(2)

0    xhi
1    med
Name: Valuation, dtype: category
Categories (5, object): ['xlo' < 'low' < 'med' < 'hi' < 'xhi']

In [33]:
df.insert(2, 'ValBin', pd.cut(df['Valuation'], bins, labels=group_names, include_lowest=True))

In [34]:
df.head(2)

Unnamed: 0,Company,Valuation,ValBin,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
0,Bytedance,180,xhi,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,med,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


In [35]:
df.insert(3, 'ValNumBin', pd.cut(df['Valuation'], bins, labels=range(1,6), include_lowest=True))

In [36]:
df.head(2)

Unnamed: 0,Company,Valuation,ValBin,ValNumBin,Date,Year,Month,Day,Industry,City,Country,Continent,Founded,Unicorn,Funding,Investors
0,Bytedance,180,xhi,5,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,med,3,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

There are multiple ways to complete this task. Review what you've learned about organizing data into equal quantiles.
    
</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

Consider using the pandas [`qcut()`](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html) function. 
    
</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

Use `pandas` `qcut()` to divide the data into two equal-sized quantile buckets. Use the `labels` parameter to define the output labels. The values you give for `labels` will be the values that are inserted into the new column. 
    
</details>

### Convert categorical data to numerical data

Three common methods for changing categorical data to numerical are:

1. Label encoding: order matters (ordinal numeric labels)
2. Label encoding: order doesn't matter (nominal numeric labels)
3. Dummy encoding: order doesn't matter (creation of binary columns for each possible category contained in the variable)

The decision on which method to use depends on the context and must be made on a case-to-case basis. However, a distinction is typically made between categorical variables with equal weight given to all possible categories vs. variables with a hierarchical structure of importance to their possible categories.  

For example, a variable called `subject` might have possible values of `history`, `mathematics`, `literature`. In this case, each subject might be **nominal**&mdash;given the same level of importance. However, you might have another variable called `class`, whose possible values are `freshman`, `sophomore`, `junior`, `senior`. In this case, the class variable is **ordinal**&mdash;its values have an ordered, hierarchical structure of importance. 

Machine learning models typically need all data to be numeric, and they generally use ordinal label encoding (method 1) and dummy encoding (method 3). 

In the next steps, you'll convert the following variables: `Continent`, `Country/Region`, and `Industry`, each using a different approach.

### Convert `Continent` to numeric

For the purposes of this exercise, suppose that the investment group has specified that they want to give more weight to continents with fewer unicorn companies because they believe this could indicate unrealized market potential. 

**Question: Which type of variable would this make the `Continent` variable in terms of how it would be converted to a numeric data type?**


[Write your response here. Double-click (or enter) to edit.]

Rank the continents in descending order from the greatest number of unicorn companies to the least.

In [37]:
# Rank the continents by number of unicorn companies
c_list = df.groupby('Continent').size().sort_values(ascending=False).keys().to_list()

In [38]:
c_list

['North America', 'Asia', 'Europe', 'South America', 'Oceania', 'Africa']

In [39]:
 df.groupby('Continent').size().sort_values(ascending=False).keys()

Index(['North America', 'Asia', 'Europe', 'South America', 'Oceania',
       'Africa'],
      dtype='object', name='Continent')

In [40]:
c_map = dict([(n,i+1) for i,n in enumerate(df.groupby('Continent').size().sort_values(ascending=False).keys())])
c_map

{'North America': 1,
 'Asia': 2,
 'Europe': 3,
 'South America': 4,
 'Oceania': 5,
 'Africa': 6}

In [41]:
try:
    df.insert(12, 'ContNum', df.Continent.apply(lambda c: c_map[c]))
except:
    print('Does ContNum column already exist?')

In [42]:
df.head(2)

Unnamed: 0,Company,Valuation,ValBin,ValNumBin,Date,Year,Month,Day,Industry,City,Country,Continent,ContNum,Founded,Unicorn,Funding,Investors
0,Bytedance,180,xhi,5,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,med,3,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,1,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


### Convert `Country/Region` to numeric

Now, suppose that within a given continent, each company's `Country/Region` is given equal importance. For analytical purposes, you want to convert the values in this column to numeric without creating a large number of dummy columns. Use label encoding of this nominal categorical variable to create a new column called `Country/Region Numeric`, wherein each unique `Country/Region` is assigned its own number. 

In [43]:
# Create `Country/Region Numeric` column
# Create numeric categories for Country/Region
df.groupby(['ContNum','Country']).agg(Size=('Country','size')).sort_values(['ContNum','Size'], ascending=[1,0])

Unnamed: 0_level_0,Unnamed: 1_level_0,Size
ContNum,Country,Unnamed: 2_level_1
1,United States,554
1,Canada,18
1,Mexico,6
1,Bahamas,1
1,Bermuda,1
2,China,170
2,India,65
2,Israel,20
2,Singapore,12
2,South Korea,12


In [44]:
df.head(2)

Unnamed: 0,Company,Valuation,ValBin,ValNumBin,Date,Year,Month,Day,Industry,City,Country,Continent,ContNum,Founded,Unicorn,Funding,Investors
0,Bytedance,180,xhi,5,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,Asia,2,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,med,3,2012-12-01,2012,12,1,Other,Hawthorne,United States,North America,1,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


In [45]:
df['Country'].astype('category').cat.codes

0        9
1       44
2        9
3       44
4       38
        ..
1057     9
1058     9
1059     9
1060    43
1061    44
Length: 1062, dtype: int8

In [46]:
c_list = df.groupby('Country').size().sort_values(ascending=False).keys().to_list()
c_list[:3]

['United States', 'China', 'India']

In [47]:
country_map = dict([(n,i+1) for i,n in enumerate(df.groupby('Country').size().sort_values(ascending=False).keys())])
country_map

{'United States': 1,
 'China': 2,
 'India': 3,
 'United Kingdom': 4,
 'Germany': 5,
 'France': 6,
 'Israel': 7,
 'Canada': 8,
 'Brazil': 9,
 'Singapore': 10,
 'South Korea': 11,
 'Australia': 12,
 'Hong Kong': 13,
 'Indonesia': 14,
 'Sweden': 15,
 'Netherlands': 16,
 'Mexico': 17,
 'Switzerland': 18,
 'Ireland': 19,
 'Japan': 20,
 'Finland': 21,
 'Norway': 22,
 'United Arab Emirates': 23,
 'Belgium': 24,
 'Spain': 25,
 'Turkey': 26,
 'Philippines': 27,
 'Vietnam': 28,
 'Thailand': 29,
 'Estonia': 30,
 'Denmark': 31,
 'Colombia': 32,
 'Chile': 33,
 'Austria': 34,
 'South Africa': 35,
 'UnitedKingdom': 36,
 'Argentina': 37,
 'Senegal': 38,
 'Nigeria': 39,
 'Malaysia': 40,
 'Luxembourg': 41,
 'Lithuania': 42,
 'Czech Republic': 43,
 'Croatia': 44,
 'Bermuda': 45,
 'Bahamas': 46,
 'Italy': 47}

In [48]:
try:
    df.insert(11, 'CountryNum', df.Country.apply(lambda c: country_map[c]))
except:
    print('Does CountryNum column already exist?')
df.head(2)

Unnamed: 0,Company,Valuation,ValBin,ValNumBin,Date,Year,Month,Day,Industry,City,Country,CountryNum,Continent,ContNum,Founded,Unicorn,Funding,Investors
0,Bytedance,180,xhi,5,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,2,Asia,2,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,100,med,3,2012-12-01,2012,12,1,Other,Hawthorne,United States,1,North America,1,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen..."


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Review what you have learned about converting a variable with a string/object data type to a category.
    
</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

To use label encoding, apply `.astype('category').cat.codes` to the `Country/Region` in `companies`.
    
</details>

### Convert `Industry` to numeric

Finally, create dummy variables for the values in the `Industry` column. 

In [49]:
pd.get_dummies(df['Industry'])

Unnamed: 0,Artificial intelligence,Auto & transportation,Consumer & retail,Cybersecurity,Data management & analytics,E-commerce & direct-to-consumer,Edtech,Fintech,Hardware,Health,Internet software & services,Mobile & telecommunications,Other,"Supply chain, logistics, & delivery",Travel
0,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1057,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1058,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1059,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
1060,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False


In [50]:
# Convert `Industry` to numeric data
# Create dummy variables with Industry values
industry_encoded = pd.get_dummies(df['Industry'])

# Combine `companies` DataFrame with new dummy Industry columns
df = pd.concat([df, industry_encoded], axis=1)

Display the first few rows of `companies`

In [51]:
df.head()

Unnamed: 0,Company,Valuation,ValBin,ValNumBin,Date,Year,Month,Day,Industry,City,Country,CountryNum,Continent,ContNum,Founded,Unicorn,Funding,Investors,Artificial intelligence,Auto & transportation,Consumer & retail,Cybersecurity,Data management & analytics,E-commerce & direct-to-consumer,Edtech,Fintech,Hardware,Health,Internet software & services,Mobile & telecommunications,Other,"Supply chain, logistics, & delivery",Travel
0,Bytedance,180,xhi,5,2017-04-07,2017,4,7,Artificial intelligence,Beijing,China,2,Asia,2,2012,5,8.0,"Sequoia Capital China, SIG Asia Investments, S...",True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,SpaceX,100,med,3,2012-12-01,2012,12,1,Other,Hawthorne,United States,1,North America,1,2002,10,7.0,"Founders Fund, Draper Fisher Jurvetson, Rothen...",False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,SHEIN,100,med,3,2018-07-03,2018,7,3,E-commerce & direct-to-consumer,Shenzhen,China,2,Asia,2,2008,10,2.0,"Tiger Global Management, Sequoia Capital China...",False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
3,Stripe,95,med,3,2014-01-23,2014,1,23,Fintech,San Francisco,United States,1,North America,1,2010,4,2.0,"Khosla Ventures, LowercaseCapital, capitalG",False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4,Klarna,46,low,2,2011-12-12,2011,12,12,Fintech,Stockholm,Sweden,15,Europe,3,2005,6,4.0,"Institutional Venture Partners, Sequoia Capita...",False,False,False,False,False,False,False,True,False,False,False,False,False,False,False


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Consider using `pd.get_dummies` on the specified column. 
    
</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

When you call `pd.get_dummies()` on a specified series, it will return a dataframe consisting of each possible category contained in the series represented as its own binary column. You'll then have to combine this new dataframe of binary columns with the existing `companies` dataframe.
    
</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

You can use `pd.concat([col_a, col_b])` to combine the two dataframes. Remember to specify the correct axis of concatenation and to reassign the result back to the `companies` dataframe.
    
</details>

**Question: Which categorical encoding approach did you use for each variable? Why?**

[Write your response here. Double-click (or enter) to edit.]

**Question: How does label encoding change the data?**


[Write your response here. Double-click (or enter) to edit.]

**Question: What are the benefits of label encoding?**


[Write your response here. Double-click (or enter) to edit.]

**Question: What are the disadvantages of label encoding?**


[Write your response here. Double-click (or enter) to edit.]

## Conclusion

**What are some key takeaways that you learned during this lab?**

[Write your response here. Double-click (or enter) to edit.]

**Reference**

[Bhat, M.A. *Unicorn Companies*](https://www.kaggle.com/datasets/mysarahmadbhat/unicorn-companies)



**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.