# Analysis 1- Shreyas Bhatt

## Milestone 2

#### Task 3 : Define and refine your research questions

**To what extent does the revenue of a company impact it's growth rate and other company dynamics?**

It is quite common for many companies to experience an unprecedented growth during their early stages of operation. Although, it is also apparant that this growth is temporary and that once a company gains enough traction, growth generally slows down. I wish to understand and quantify the nature of this relationship by comparing the revenue of the company with CAGR (compounded growth). I also wish to understand what sorts of metrics seem to be helpful for large companies to maintain growth- from employee count to where the company is based and so on. 

My dataset features revenue entries for 2020 and for 2017, and moving on I may have to either combine this data in someway or observe the relationships of both of these revenues. Currently I feel that it may incorporate some inconsistency if I merge them, as well as looking at 2020 solely would likely not be a fair representation of the companies given the economic climate.

#### Task 2 : Load Dataset

In [None]:
# loading dataframe
import pandas as pd
pd.read_csv("../data/raw/FT1000.csv")

## Milestone 3 
### Task 1: Exploratory Data Analysis (EDA)

In [None]:
# Importing libraries for analysis and visualization (Pandas already imported)
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

# Trying to understand the basic details of data
df = pd.read_csv("../data/raw/FT1000.csv")
df.info()
print('Shape of data is', df.shape) 

In [None]:
print('Unique Variables Below')
df.nunique(axis=0)

In [None]:
df.describe()

#### Observations (1/2)
From the above commands, there are a few important details I have noticed. 

- Firstly, some of the figures are a little strange. The lowest number of employees for one of the top 1000 european companies in 2020 was 1 person. This seems quite unplausible and even if it truly were the case that such a company did have 1 employee in 2020, I do not think it makes much sense to compare this company with other companies because of how different it must be. Some companies also did not exist in 2017 with 0 employees. Therefore I will consider an employee count filter.

- I should rename some rows to be more clear and with units. 

- The integer data types are fine but some data types could be better expressed as categories or strings etc.

- Elsewise I think that the data looks reasonable at a first glance, no non-null count and the shape of the data is accurate.

Now data may be illustrated.

In [None]:
#Illustrating Data

# 1) HeatPlot
corr = df.corr(numeric_only = [True])
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(215, 15, as_cmap=True))

# 2) Histogram of everything
hist = df.hist(bins=10,figsize =(15,10), log=True)

# 3) Bar Graph illustrating ranked status in 2020 and sector
grouped_data = df.groupby(['Sector', 'Ranked2020']).size().unstack(fill_value=0)
grouped_data.plot(kind='bar', stacked=True, figsize=(15, 6))
plt.title('Sector and Rank Distribution')
plt.xlabel('Sector')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

# 4) Graph showing relationship betwen revenue of company and CAGR growth rate
plt.scatter(df['Revenue2020'], df['CAGR'])
# Set the x-axis label and y-axis label
plt.title('Revenue of company in 2020 vs CAGR growth rate')
plt.xlabel('2020 Revenue ($)')
plt.ylabel('CAGR %')
plt.xscale('log')
plt.yscale('log')

#### Observations (2/2)  

- Firstly, the correlation between 2020 revenue of a company and it's CAGR growth rate is not very well correlated with a score of 0.13 and observable on the scatterplot. \
Therefore I want to pivot towards investigating the relationship between other areas that revenue seems to correlate to such as the relationship between revenue in 2017 \
of a company and it's revenue in 2020 as well as looking at employee dynamics.
- I would say that the data is actually quite volatile seeing as many of the companies that were included in the 2017 rankings were not included in the 2020 rankings as illustrated. Alot of the correlations seem to be fairly unsubstantial as well. 
- The ranges of my data are also quite huge. Revenue can vary by multiple degrees of magnitude and so I had to use a logarithmic scale for analysis.

As for a further analysis plan, I may investigate means of analysing my data by considering the logarithmic scale. It may be that alot of relationships have a low \
linear correlation value but could infact have a high correlation if I were to analyse it with different curves instead like polynomial regressions etc.
I would also say that averaging 2020 and 2017 revenue values and comparing that with the average employee count in 2017 and 2020 could provide some insight. Perhaps \
the data would be more stable in comparison to looking at each value in isolation.
I also want to sort the data by CAGR % as it originally was and also fix the index so that it starts from 1. 

## Milestone 4
#### Analysis Pipeline

In [None]:
# I decided to clean up my data to start with. A few things were changed.

df_clean = df.copy().drop(['Rank'], axis=1) # Dropping Rank as redundant

df_clean = df_clean[df_clean['Employees2017'] > 0] # Dropping companies that had no employees in 2017

df_clean = df_clean[df_clean['Employees2020'] > 2] # Dropping companies that had lesser than 3 employees in 2020. 3 is arbitrary although I feel that this should be a bare minimum.

#Appropriately setting data types
df_clean['Name'] = df_clean['Name'].astype('str')
df_clean['Ranked2021'] = df_clean['Ranked2021'].astype('category')
df_clean['Ranked2020'] = df_clean['Ranked2020'].astype('category')
df_clean['Country'] = df_clean['Country'].astype('category')
df_clean['Sector'] = df_clean['Sector'].astype('category')

#Renaming Rows
column_labels = {'Name':'Name',
             'Ranked2021':'Ranked 2021?',
             'Ranked2020':'Ranked 2020?',
             'Country':'Country',
             'Sector':'Sector',
             'CAGR':'CAGR %',
             'Revenue2020':'2020 Revenue ($)',
             'Revenue2017':'2017 Revenue ($)',
             'Employees2020':'Employees in 2020',
             'Employees2017':'Employees in 2017',
             'FoundingYear':'FoundingYear',
             }
df_clean.rename(columns= column_labels, inplace=True)
df_clean

Data cleaning got rid of 13 rows of redundant data and the data is overall more tidy.

### Method Chaining.

This process was made into a function and put into a python file

In [None]:
import sys
sys.path.insert(0, '/analysis/code')
from project_functions1 import load_and_process
df = load_and_process("../data/raw/FT1000.csv")