# Week 8 - Concatenate Dataframes (Python) [codingwatchcity]

## Reference 
- http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.read_csv.html
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.concat.html
- http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

## Contents
1. Setup
2. Join using `concat` from Pandas

## 1. Setup

In [5]:
%python
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

In [6]:
%python weather_stations = ["KCLT","KCQT","KHOU","KIND","KJAX","KMDW","KNYC","KPHL","KPHX","KSEA"]

In [7]:
%python 
import pandas as pd
khou_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/us-weather-history/KHOU.csv',
                       parse_dates=['date']) 
khou_pdf.info()

In [8]:
%python 
khou_pdf.head(3)

In [9]:
%python 
import pandas as pd
def get_kxxx_dataframe(station_name):
  filepath = '/dbfs/mnt/datalab-datasets/538_data/us-weather-history/'+station_name+'.csv'
  return \
  pd.read_csv(filepath,
              header=0,
              parse_dates=['date']
             ) \
    .set_index('date')
          

In [10]:
%python
get_kxxx_dataframe('KSEA').info()

In [11]:
%python
pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/us-weather-history/KSEA.csv')

In [12]:
%python
get_kxxx_dataframe('KSEA').head(3)

In [13]:
%python weather_stations

In [14]:
%python
weather_station_df_list = list(map(get_kxxx_dataframe,
                                   weather_stations))

In [15]:
%python
list(map(lambda df: df.head(),
         weather_station_df_list))

In [16]:
%python
list(map(lambda station: (station,get_kxxx_dataframe(station)),
         weather_stations))

__Exercise 1:__ Using the `list` and `map` functions (as above) create a list that contains the first five rows of each dataframe in `weather_station_df_list`.

**Interpretation:**   
1) Code:
- `weather_station_df_list` is a  list of dataframe.
- Use **lambda()** function to create a function to show first five rows of the dataframe
- Use **map()** function to apply **lambda** function on each element listed in the list named `weather_station_df_list`.
- Use **list()** function to return a list of dataframe.

2) Output:
- A list of dataframe is generated which displayed the first five rows of `weather_station_df_list`

In [19]:
%python
list(map(lambda df: df.head(5),
         weather_station_df_list))

__Exercise 2:__ Using a list comprehension create a list that contains the first five rows of each dataframe in `weather_station_df_list`.

__Interpretation:__  
1) Code
- Use **for loop** to apply the head funtion to each element of the `weather_station_df_list`
- Use square branket to change the result to a list

2) Output:
- A list of dataframe is generated which displayed the first five rows of `weather_station_df_list`

In [22]:
%python
[df.head() for df in weather_station_df_list]

## 2. Join using `concat` from Pandas

In [24]:
%python
import pandas as pd
weather_station_df_concat = \
pd.concat(weather_station_df_list, 
          axis=1, 
          join='inner'
         )
weather_station_df_concat.head()

In [25]:
%python 
weather_station_df_concat.info()

In [26]:
%python 
weather_station_df_concat.columns

In [27]:
%python 
list(weather_station_df_concat.columns)

In [28]:
%python 
sorted(list(weather_station_df_concat.columns))

In [29]:
%python 
pd.Series(data=sorted(list(weather_station_df_concat.columns))).value_counts()


The dataframe `weather_station_df_concat` returned by `concat` produces the same problematic dataframe, with duplicate column names, as produced by SQL. The `concat` function can accomodate this data if we pass it dictionary of dataframes where the values are the dataframes and the keys (of the dictionary) are the weather station names (i.e. 'KSEA', 'KNYC'). 

First take a look at this dictionary:

In [31]:
%python
weather_station_partial_df_dict = dict(map(lambda name: (name.lower(),
                                                         get_kxxx_dataframe(name).head()
                                                        ),
                                           weather_stations))
weather_station_partial_df_dict

In [32]:
%python
weather_station_df_dict = dict(map(lambda name: (name.lower(),
                                                 get_kxxx_dataframe(name)
                                                ),
                                   weather_stations))

In [33]:
%python
import pandas as pd
weather_station_df_axis_1 = \
pd.concat(weather_station_df_dict, 
          axis=1, 
          join='inner'
         )

In [34]:
%python 
weather_station_df_axis_1 \
  .info()

The dataframe `weather_station_df` contains a multi-index on axis 1 (columns). Recall the recent notebook on [w06.1 Indexes](https://bentley.cloud.databricks.com/#notebook/1127461), which contains notes on multi-indexes. 

In addition, notice that this dataframe is time series with a datetime column index.

In [36]:
%python
weather_station_df_axis_1 \
  .loc['2015-01':'2015-02',[('knyc'),('ksea')]] 

For more details see [Merge, join, and concatenate](http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
from the [Python User Guide](http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html).

In [38]:
%python
import pandas as pd
weather_station_df_axis_0 = \
pd.concat(weather_station_df_dict, 
          axis=0, 
          join='inner'
         )

weather_station_df_axis_0.head()

In [39]:
%python 
weather_station_df_axis_0 \
  .info()

In [40]:
%python 
weather_station_df_axis_0 \
  .head()

In [41]:
%python 
weather_station_df \
  .head()

__The End__