# Exploratory Data Analysis

## Forbes’ 18th annual ranking of the world’s 2,000 largest public companies

We will anwser some questions, as:

1. Q01 Which countries/territories have more companies in this dataset?
1. Q02 Which are the top 10 countries/territories with more companies in this dataset?

### Basic methods to transform Data

In [None]:
import plotly.graph_objects as go
import pandas as pd
df = pd.read_csv('/kaggle/input/forbes-2020-global-2000-largest-public-companies/forbes_top_2000_world_largest_public_companies.csv')

Check the DataFrame's head to understand how are the data.

In [None]:
df.head()

We can check if any column there are some missing values using the **df.info()**. In this case, there aren't missing values, but if there are missing values we can use **df.isna().sum()** to see how many values are missing by columns.

In [None]:
df.info()

Let's start converting the fourthy columns that represent integer values, they are: Sales, profits, assets and market value.
For reasech this objective, we will create a map and convert the values to the correct.

In [None]:
m = {'K': 1000, 'M': 1000000, 'B': 1000000000, 'T': 1000000000000}
columns = ['sales', 'profits', 'assets', 'market_value']
def convert_value(row):
    values = []
    for column in columns:
        value = str(row[column]).replace(',', '').replace('$', '').split()
        values.append(float(float(value[0]) * m[value[-1]]))
    return values

Then we fill the columns with converted values. We can see dtype was changed from object to float, now we have numbers to work.

In [None]:
df['sales'], df['profits'], df['assets'], df['market_value'] = zip(*df.apply(convert_value, axis=1))

In [None]:
df.info()

### Let's grouping some values to plot graphs

To answer the first question: **Q01 Which countries/territories have more companies in this dataset?**, we will group the rows by column **contry/territory** using **df.groupby()** method. We want figure out how many companies exist by column. 

This context shows us to use the **count()** method to count how many rows exist by countries/territories.
We will sort values from descending to ascending to highest country is on top, using **sort_values(by=['company'], ascending=False)**.

In [None]:
temp = df.groupby('contry/territory', as_index=False).count().sort_values(by=['company'], ascending=False)
temp.head()

Let's plot the top 10 companies in this dataset.

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(x=temp['contry/territory'][:10], y=temp['company'][:10], text=temp['company'][:10], textposition='auto'))

fig.show()

In [None]:
temp = df.groupby('contry/territory', as_index=False).sum().sort_values(by=['market_value'], ascending=False)
temp.head()

In [None]:
def human_format(num):
    value = 0
    while abs(num) >= 1000:
        value += 1
        num /= 1000.0
    return '%.2f%s' % (num, ['', 'K', 'M', 'G', 'T'][value])

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(x=temp['contry/territory'][:10], y=temp['market_value'][:10], text=temp['market_value'].apply(human_format)[:10], textposition='auto'))

fig.show()

This graph shows us UK and CA lost their positions, that's mean there are other countries with companies' values and the how many didn't important in this case.