Importing pandas and numpy library for doing data exploration.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

Loading the csv file into the memory into a pandas dataframe.
Printing the first 5 rows to check the data.

In [None]:
dataframe_companies = pd.read_csv("../input/companies_sorted.csv")
dataframe_companies.head()

Let's perform the subset selection. Select the columns that we care about and ignore the rest of the data.

In [None]:
dataframe_companies = dataframe_companies[['name','year founded','industry','size range','country','current employee estimate']]
dataframe_companies.head()

Printing the info of the dataframe to check what type of data is held within each column. 
Each column that has datatype set to object is potentially a categorical data.

In [None]:
print(dataframe_companies.info())

Drawing a box plot for checking retlationship of categorical feature and continious feature.

In [None]:
dataframe_companies.boxplot('year founded','size range',figsize = (10,10))

Printing total null values in the dataset. Then printing the column wise distribution of null values across dataset.

In [None]:
print('Total null values in the dataset: ',dataframe_companies.isnull().values.sum())
print('Column wise distribution of null values in the dataset')
print(dataframe_companies.isnull().sum())

Handling the missing or null values.

In [None]:
#Removing the row where company name is null. Since it's completely useless. There are 3 rows where company name is null.
#axis = 0 defines that we need to delete the row. If axis is 1 then the column would be deleted.
#subset defines which column to consider for null values.
dataframe_companies = dataframe_companies.dropna(axis = 0,subset = ['name'])

#Cross checking if the null values were deleted properly.
print('Total null values in the dataset: ',dataframe_companies.isnull().values.sum())
print('Column wise distribution of null values in the dataset')
print(dataframe_companies.isnull().sum())

In [None]:
#fill the null values in the country by inserting "missing" in the column where it's null or empty.
dataframe_companies['country'].fillna('missing',inplace = True)
dataframe_companies['industry'].fillna('missing',inplace = True)

#Keep only those rows where we have atleast 3 non null column values. Drop rest of them.
dataframe_companies.dropna(thresh = 3, inplace = True)

#Cross checking 
print('Total null values in the dataset: ',dataframe_companies.isnull().values.sum())
print('Column wise distribution of null values in the dataset')
print(dataframe_companies.isnull().sum())

In [None]:
#Drawing a histogram for the year.
dataframe_companies.hist('year founded',bins = 10)

Printing the median of the year. Since from the figure it's pretty much clear the year founded field has a median somewhere in mid 2000s

In [None]:
print(dataframe_companies['year founded'].median())

Filling in the missing value with median

In [None]:
dataframe_companies.fillna(dataframe_companies['year founded'].median(), inplace = True)

#Cross checking 
print('Total null values in the dataset: ',dataframe_companies.isnull().values.sum())
print('Column wise distribution of null values in the dataset')
print(dataframe_companies.isnull().sum())

Converting the categorical data into numerical data. 
Printing all the unique values for a categorical data to check,
* Nominal : no order associated
* ordinal : some order associated
* continious : infine values between two values

In [None]:
dataframe_companies.industry.value_counts()


Industry is a nominal data type since there is no order associated with the industry. 
Let's use Scikit learn label encoder for converting the nominal data into numeric data by assigning a unique number from 0 to N - 1 = 0 to 148 for each industry.

In [None]:
labelEncoder = LabelEncoder()
industry_labels = labelEncoder.fit_transform(dataframe_companies['industry'])
industry_mappings = {index: label for index, label in enumerate(labelEncoder.classes_)}
print(industry_mappings)

Adding these labels to our dataset as a new column.

In [None]:
dataframe_companies['industry_mapping'] = LabelEncoder().fit_transform(dataframe_companies['industry'])
dataframe_companies.head()

Performing the same steps for country field for converting categorical data to numerical data.

In [None]:
dataframe_companies.country.value_counts()


In [None]:
dataframe_companies['country_mapping'] = LabelEncoder().fit_transform(dataframe_companies['country'])
dataframe_companies.head()

Renaming certain column headers since there is space between the column names.

In [None]:
dataframe_companies.rename(index=str, columns={"size range": "size_range","year founded": "year_founded","current employee estimate":"current_employee_estimate"},inplace = True)
dataframe_companies.head()


Checking how many unique values the size range contains

In [None]:
dataframe_companies.size_range.value_counts()

Converting this categorical variable into labels by label encoder

In [None]:
dataframe_companies['size_range_mapping'] = LabelEncoder().fit_transform(dataframe_companies['size_range'])
dataframe_companies.head()

Convert the year field from float to int.

In [None]:
dataframe_companies['year_founded'] = dataframe_companies['year_founded'].astype(np.int64)
print(dataframe_companies.info())
dataframe_companies.head()

We have now successfully converted all the fields to numerical data. 
For countries and industry we can further create one-hot encoding since there is no order associated with them in future. 
Also just drop the categorical textual fields and save the table.

Saving the dataset as CSV before more modification. So that we have a CSV backup. Check out the output tab on the left navigation bar to see the result. Click on the download dataset to download it.

In [None]:
dataframe_companies.to_csv('Companies_Cleaned_Dataset.csv', sep=',', encoding='utf-8')
