### Presentation of the DataSet scope

This Dataset of Tesla sales car has been created by combining three differetn sources :
- A cleaned dataset from [Data.World](https://data.world/gpal/tesla-vehicle-deliveries/workspace/file?filename=tesla_vehicles.csv) presenting the total sales of Teslas from 2008 Q3 to 2018 Q1
- A cleaned dataset from [Kaggle](https://www.kaggle.com/eduardopedron/tesla-vehicle-sales-by-quarters), presenting the total sales of all Tesla from 2015 Q4 to 2019 Q4
- A scraping from the [Tesla Model 3's Wikipedia page](https://en.wikipedia.org/wiki/Tesla_Model_3#Production) in order to get a DataFrame of all Model 3's sales from 2017 onwards

Once the three datasets collected, the three of them will be melted in one clean dataset, in order to measure the eovlutions of sales of Tesla cars, and notably the evolution of Model 3 sales compared to the other models.
Therefore the final dataset will measure the sales of Telsa sales from 2008Q3 to 2019Q4, with two columns : 
- Tesla car sales as a whole
- Tesla Model 3 sales from 2017 onwards


Even if Tesla sales so far 4 different car Models (Model S, Model X, Model 3 & the Roadster), the Dataframe is only isolating the Model 3 sales, sold from 2017 onwards :
- other Models sales are only easily findable - and they may be considered as "niche market" cars, due to their pricetag and their luxury target
- on the contrary, the Model 3 aimed to be the first and main mass-market Tesla car. Tesla strategy has always been first to sale luxury car (Models S & X), in order to build the brand reputation and prove the viability of the project. Once that goal was fulfiled, the technology was mature and enough cashflow entered, the second phase of the project was to launch the Model 3.

### Importing Libraries for Data Gathering & Data Cleaning

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime as dt
import re

%matplotlib inline
import matplotlib.pyplot as plt 

## Importing CSVs of Tesla sales

In [2]:
tesla_car_sales_2008_2018_raw = pd.read_csv(r'D:\IronHack\IronHack_Classes\TSLA_project\DataSets\tesla_car_sales_2008_2018.csv')

In [3]:
tesla_car_sales_2015_2019_raw = pd.read_csv(r'D:\IronHack\IronHack_Classes\TSLA_project\DataSets\Tesla_car_sales_2015_2019.csv')

Cleaning both DataSets to be merge them later on.

Columns kep will feature : 
- Year
- Quarter
- Amount of sold vehicles

# 1 - Data Wrangling

### 1 - 1. Wrangling of 2008-2018 sales DataFrame :
- Droping unused columns "Months" & "Total Vehicles sold" (cumulating all passed sales)
- renaming columns "New Vehicles "Car sales"
- Keeping only 3 columns : "Year", "Quarter" & "Car sales"

In [4]:
tesla_car_sales_2008_2018_raw.head()

Unnamed: 0,Year,Quarter,Months,Total Vehicles,New Vehicles
0,2008,2,Apr-Jun,3,3
1,2008,3,Jul-Sep,30,27
2,2008,4,Oct-Dec,100,70
3,2009,1,Jan-Mar,320,220
4,2009,2,Apr-Jun,500,180


In [5]:
#Droping unused columns "Months" & "Total Vehicles sold" (cumulating all passed sales)
tesla_car_sales_2008_2018 = tesla_car_sales_2008_2018_raw.drop(columns=['Months', 'Total Vehicles'])

In [6]:
tesla_car_sales_2008_2018.columns=(['Year', 'Quarter', 'All Model Sales'])

In [7]:
tesla_car_sales_2008_2018

Unnamed: 0,Year,Quarter,All Model Sales
0,2008,2,3
1,2008,3,27
2,2008,4,70
3,2009,1,220
4,2009,2,180
5,2009,3,200
6,2009,4,300
7,2010,1,100
8,2010,2,100
9,2010,3,100


### 1 - 2. Wrangling of 2015-2019 sales DataFrame:
- pd.melt() to transpose "Quarter" columns into rows
- renaming columns "variable" into Quarter & "Value" into "Car sales"
- str.replace() to get rid of the "Q" string in order to only keep the Quarter number in the Quarter column

In [8]:
tesla_car_sales_2015_2019_raw.head()

Unnamed: 0,Year,Q1,Q2,Q3,Q4
0,2015,0,0,0,17400
1,2016,14820,14370,24500,22200
2,2017,25000,22000,26150,29870
3,2018,29980,40740,83500,90700
4,2019,63000,95200,97000,112000


In [9]:
tesla_sales_2015_2019 = pd.melt(tesla_car_sales_2015_2019_raw, id_vars='Year')

In [10]:
tesla_sales_2015_2019.columns=(['Year', 'Quarter', 'All Model Sales'])

In [11]:
tesla_sales_2015_2019['Quarter'] = tesla_sales_2015_2019['Quarter'].str.replace('Q','', regex=True)

Sorting Values by Year & Quarter and Reseting Index of the Dataframe

In [12]:
tesla_sales_2015_2019.sort_values(by=['Year', 'Quarter'], inplace=True)

In [13]:
tesla_sales_2015_2019.reset_index(drop=True, inplace=True)

In [14]:
tesla_sales_2015_2019

Unnamed: 0,Year,Quarter,All Model Sales
0,2015,1,0
1,2015,2,0
2,2015,3,0
3,2015,4,17400
4,2016,1,14820
5,2016,2,14370
6,2016,3,24500
7,2016,4,22200
8,2017,1,25000
9,2017,2,22000


### 1 - 3. Scraping Model 3 vehicles production

In [15]:
#### scraping Wikipedia table

In [16]:
model_3_sales_raw = pd.read_html('https://en.wikipedia.org/wiki/Tesla_Model_3#Production')[1]

In [17]:
model_3_sales_raw

Unnamed: 0,Quarter,Model 3 vehicles produced
0,2017 Q3[21],260(222 delivered)
1,2017 Q4[22],"2,425(1,542 delivered)"
2,2018 Q1[126],"9,766(8,182 delivered)"
3,2018 Q2[127][128],"28,578(18,449 delivered)"
4,2018 Q3[129][130],"53,239(56,065 delivered)"
5,2018 Q4[131][23],"61,394(63,359 delivered)"
6,2019 Q1[24][132],"62,975(50,928 delivered)"
7,2019 Q2[25][133],"72,548(77,634 delivered)"
8,2019 Q3[26][134],"79,837(79,703 delivered)"
9,2019 Q4[135],"86,958(92,550 delivered)"


Defining functions to clean the DataSets of all unsued string information

In [18]:
def quarter_cleaner(string):
    """
    Clean the Time column by deleting useless reference quotations
    
    Input: raw column
    Output: cleansed
    """
    
    return re.findall('\d+ Q\d', string)[0]

In [19]:
# def parenthesis_cleaner(string):
#     """
#     Clean the Time column by deleting useless reference quotations
    
#     Input: raw column
#     Output: cleansed
#     """
#     string = string.str.split('(').str[0]
#     string = string.str.replace(',','', regex=True)
    
#     return string

##### Data wrangling to have unified clean data with other Tesla delivery datasets

The Dataset imported from Wikipedia differentiate Model 3 produced & those delivered.
To simplify, because all Models 3 were sold upfront to consumers years before the delivery through online commands, the production value will be hereafter be considered the Sales value.

In [20]:
model_3_sales_raw['Quarter'] = model_3_sales_raw['Quarter'].apply(lambda x : quarter_cleaner(x))

In [21]:
model_3_sales_raw

Unnamed: 0,Quarter,Model 3 vehicles produced
0,2017 Q3,260(222 delivered)
1,2017 Q4,"2,425(1,542 delivered)"
2,2018 Q1,"9,766(8,182 delivered)"
3,2018 Q2,"28,578(18,449 delivered)"
4,2018 Q3,"53,239(56,065 delivered)"
5,2018 Q4,"61,394(63,359 delivered)"
6,2019 Q1,"62,975(50,928 delivered)"
7,2019 Q2,"72,548(77,634 delivered)"
8,2019 Q3,"79,837(79,703 delivered)"
9,2019 Q4,"86,958(92,550 delivered)"


In [22]:
model_3_sales_raw['Model 3 vehicles produced'] = model_3_sales_raw['Model 3 vehicles produced'].str.split('(').str[0]

In [23]:
model_3_sales_raw['Model 3 vehicles produced'] = model_3_sales_raw['Model 3 vehicles produced'].str.replace(',','', regex=True)

In [24]:
model_3_sales = model_3_sales_raw['Quarter'].str.split(" ", n = 1, expand = True) 

In [25]:
model_3_sales['Model 3 produced'] = model_3_sales_raw['Model 3 vehicles produced']

In [26]:
model_3_sales.columns=(['Year', 'Quarter', 'Model 3 sales'])

In [27]:
model_3_sales['Quarter'] = model_3_sales['Quarter'].str.replace('Q','', regex=True)

In [28]:
model_3_sales['Year'] = pd.to_datetime(model_3_sales['Year'])
model_3_sales['Year'] = model_3_sales['Year'].dt.year

In [29]:
model_3_sales

Unnamed: 0,Year,Quarter,Model 3 sales
0,2017,3,260
1,2017,4,2425
2,2018,1,9766
3,2018,2,28578
4,2018,3,53239
5,2018,4,61394
6,2019,1,62975
7,2019,2,72548
8,2019,3,79837
9,2019,4,86958


### 2 - Melting all 3 dataframe into one global All Tesla cars sales

In [50]:
tesla_total_sales = tesla_car_sales_2008_2018.copy()

In [51]:
tesla_total_sales = tesla_total_sales.append(tesla_sales_2015_2019, ignore_index=True)

In [52]:
tesla_total_sales.drop(tesla_total_sales.index[39:52], inplace=True)

In [53]:
tesla_total_sales.reset_index(drop=True, inplace=True)

In [54]:
tesla_total_sales

Unnamed: 0,Year,Quarter,All Model Sales
0,2008,2,3
1,2008,3,27
2,2008,4,70
3,2009,1,220
4,2009,2,180
5,2009,3,200
6,2009,4,300
7,2010,1,100
8,2010,2,100
9,2010,3,100


In [35]:
# tesla_total_sales['Year'] = pd.to_datetime(tesla_total_sales['Year'])
# tesla_total_sales['Year'] = tesla_total_sales['Year'].dt.year

In [55]:
tesla_total_sales.dtypes

Year                int64
Quarter            object
All Model Sales     int64
dtype: object

In [48]:
model_3_sales

Unnamed: 0,Year,Quarter,Model 3 sales
0,2017,3,260
1,2017,4,2425
2,2018,1,9766
3,2018,2,28578
4,2018,3,53239
5,2018,4,61394
6,2019,1,62975
7,2019,2,72548
8,2019,3,79837
9,2019,4,86958


Merging the All Model DataFrame with Model 3 Sales DataFrame

In [56]:
tesla_total_sales = pd.merge(tesla_total_sales, model_3_sales, how='left', on=['Year', 'Quarter'])

In [57]:
tesla_total_sales.dtypes

Year                int64
Quarter            object
All Model Sales     int64
Model 3 sales      object
dtype: object

#### Data Wrangling 
- droping duplicate Quarter column
- filling Model 3 Sales "Nan" with 0
- adding new columns "Luxury Car", which would represent the sales of only Luxury Cars (Model S & Model X), by subscribing "Model 3 sales" to "All Model Sales"

In [59]:
tesla_total_sales['Model 3 sales'].fillna(value=0, inplace=True)

In [60]:
tesla_total_sales['Model 3 sales'] = tesla_total_sales['Model 3 sales'].astype(int)

In [61]:
tesla_total_sales['Luxury Models sold'] = tesla_total_sales['All Model Sales'] - tesla_total_sales['Model 3 sales']

In [58]:
tesla_total_sales

Unnamed: 0,Year,Quarter,All Model Sales,Model 3 sales
0,2008,2,3,
1,2008,3,27,
2,2008,4,70,
3,2009,1,220,
4,2009,2,180,
5,2009,3,200,
6,2009,4,300,
7,2010,1,100,
8,2010,2,100,
9,2010,3,100,


One outlier, i.e misvalue to be corrected manually : Q12019 : 12100 Luxury Models produced instead of 25
cf. [Tesla Q12019 report](https://ir.tesla.com/news-releases/news-release-details/tesla-q1-2019-vehicle-production-deliveries)

In [62]:
tesla_total_sales['Luxury Models sold'][43] = 12100

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [63]:
tesla_total_sales.drop(columns = 'All Model Sales', inplace=True)

In [64]:
tesla_total_sales

Unnamed: 0,Year,Quarter,Model 3 sales,Luxury Models sold
0,2008,2,0,3
1,2008,3,0,27
2,2008,4,0,70
3,2009,1,0,220
4,2009,2,0,180
5,2009,3,0,200
6,2009,4,0,300
7,2010,1,0,100
8,2010,2,0,100
9,2010,3,0,100


Exporting to CSV

In [67]:
tesla_total_sales.to_csv('../DataSets/Tesla_car_sales/tesla_total_sales_2008_2019.csv', index=False)