<h1 style="text-align: center; font-size: 2rem; color: #38565c; font-weight: bold">
    Welcome to part 3 of your pandas training!
</h1>

<p style="text-align: left; font-size: 1rem; color: #38565c;">
    You have learned how to get a pandas dataframe in to your python environment by creating it, reading from file or from a database - Awesome!
</p>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
    You have also learned how to select relevant data from DataFrame and Series objects, and then also manipulate the dataframe by assigning new values to it. Being able to select data and doing manipulations on the dataframe is critical knowledge if you want to start using python and pandas.
</p>


<p style="text-align: left; font-size: 1rem; color: #38565c;">
    Before you start to do any manipulations, you want to have a <b style="color: #000;">better understanding of your data</b>. Because for the majority of situations, the data does not come in the format you want for your desired task. Therefore, we will now learn how to create a quick overview of your data and how you can apply specific logic for manipulating your dataframe.
</p>


<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
   Import pandas and load data to a dataframe
</p>

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("../input/videogamesales/vgsales.csv")

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    I always start by looking at the data using the <code>.head()</code> function.
</p>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
    This is a nice way to learn about your data by displaying the column names and some of the values.
</p>
<b style="color: #000"> What can you say about the data so far? </b>

In [None]:
df.head(10)

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    The next action is to use the <code>.info()</code> function.
</p>

<p style="text-align: left; font-size: 1rem; color: #38565c;">
    Here we will learn many interesting things about the dataframe
</p>
<b style="color: #000"> Do you see anything that might be of interest? </b>

In [None]:
df.info()

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    To learn more about the actual data, I then use the <code>.describe()</code> function.
</p>

<p style="text-align: left; font-size: 1rem; color: #38565c;">
    This function yields different summary statistics
</p>
<b style="color: #000"> What conclusions can you draw from this output?</b>

In [None]:
df.describe()

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    It's also possible to perform summary statistics on specific column using functions such as <code>.mean()</code>, <code>.sum()</code>, <code>.std()</code> to name a few.
</p>

<b style="color: #000"> What column would be suitable to run the <code>.sum()</code> function on?</b>



In [None]:
df["Global_Sales"].sum()

<b style="color: #000"> How would you interpret the below statements?</b>

In [None]:
df[df["Rank"] <= 100]["Global_Sales"].mean()

In [None]:
df[df["Rank"] >= 100]["Global_Sales"].mean()

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    With the <code>.value_counts()</code> function, we can learn how the data is distributed within and across the column values.
</p>


In [None]:
df["Year"].value_counts()

<p style="text-align: left; font-size: 1.25rem; color: #38565c;">
    The <code>.unique()</code> and <code>.nunique()</code> functions gives me insights in the uniqueness of the data within the colum values
</p>


In [None]:
df["Publisher"].unique()

In [None]:
df["Publisher"].nunique()

<p style="text-align: left; font-size: 1.5rem; color: #38565c;">
    That's enough with insights - lets start manipulating the data!
</p>

<p style="font-size: 1.25rem; color: #38565c;">To apply a function to each value in a series you can use the <code>.map()</code> function</p>

<p style="font-size: 1.25rem; color: #38565c;">Let's see how we can use it on the <b>Platform</b> column</p>

In [None]:
def sort_platform_provider(platform):
    if platform in ["PSP", "PS2", "PS3", "PS", "PS4", "PSV"]:
        return "Sony Playstation"
    elif platform in ["DS", "Wii", "GBA", "3DS", "N64","SNES" "WiiU", "NES", "GC"]:
        return "Nintendo"
    elif platform in ["XB", "XOne", "X360"]:
        return "Microsoft XBOX"
    elif platform in ["PC"]:
        return "PC"
    else:
        return "Other"

In [None]:
df["Platform Provider"] = df["Platform"].map(lambda platform: sort_platform_provider(platform))


In [None]:
def sort_platform_provider(platform):
    if platform in ["PSP", "PS2", "PS3", "PS", "PS4", "PSV"]:
        return "Sony Playstation"
    elif platform in ["DS", "Wii", "GBA", "3DS", "N64","SNES" "WiiU", "NES", "GC"]:
        return "Nintendo"
    elif platform in ["XB", "XOne", "X360"]:
        return "Microsoft XBOX"
    elif platform in ["PC"]:
        return "PC"
    else:
        return "Other"

In [None]:
df["Platform Provider"] = df["Platform"].map(lambda platform: sort_platform_provider(platform))

In [None]:
df.head()

<p style="font-size: 1.25rem; color: #38565c;">To apply a function to each column (or row) in a dataframe you can use the <code>.apply()</code> function</p>
<p style="font-size: 1.25rem; color: #38565c;">Lets say that we want to know where each game has sold most copies</p>

In [None]:
df.apply(lambda row: print(row), axis=1)

In [None]:
def most_copies_sold_in(row):
    col = row.idxmax()
    if col == "NA_Sales":
        return "North America"
    elif col == "EU_Sales":
        return "Europe"
    elif col == "JP_Sales":
        return "Japan"
    else:
        return "Other"

In [None]:
df["Top Sales Region"] = df[["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"]].apply(lambda row: most_copies_sold_in(row), axis=1)

In [None]:
df.head(20)

<p style="font-size: 1.25rem; color: #38565c;">Can we draw any new insights with these two newly created columns?</p>

In [None]:
df["Platform Provider"].value_counts()

In [None]:
df.groupby(["Platform Provider"])["Top Sales Region"].value_counts()

<h1 style="text-align: center; font-size: 2rem; color: #38565c; font-weight: bold">
    <a href="https://www.kaggle.com/kernels/fork/595524">Now its time for exercises!</a>
</h1>

<hr style="border-top: 3px solid #bbb;">

<h1 style="text-align: center; font-size: 2rem; color: #38565c; font-weight: bold">
    Welcome to the next part of your pandas training!
</h1>

<p style="text-align: left; font-size: 1rem; color: #38565c;">
    You will soon have all the tool that's requried to start working your data using Pandas!
</p>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
    Remeber that the community is very big. Hence if you ever end up in a situation where your problem can't be answered by you using the documentation, then in 99% of the situations, you'll find the answer on StackOverflow. 
</p>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
     The agenda for today:
    <ul>
        <li>Review of exercises</li> 
        <li>Grouping and Sorting</li> 
        <li>Data types and Missing data</li>
        <li>Renaming and Combining data</li>
    </ul>
</p>



<h1 style="text-align: center; font-size: 2rem; color: #38565c; font-weight: bold">
    <a href="https://www.kaggle.com/kernels/fork/595524">Exercises Review</a>
</h1>

<h1 style="font-size: 2rem; color: #38565c; font-weight: bold">
    Grouping and Sorting
</h1>

<h2 style="font-size: 1.5rem; color: #38565c; font-weight: bold">
    Grouping
</h2>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
    The group by function is applied to a DataFrame and involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups. It's similar to the SQL GROUP BY operation.
        <ul>
            <li style="font-size: 1rem; color: #38565c;"><b>Splitting</b> the data into groups based on some criteria</li> 
        <li style="font-size: 1rem; color: #38565c;"><b>Applying</b> a function to each group independently</li> 
        <li style="font-size: 1rem; color: #38565c;"><b>Combining</b> the results into a data structure</li>
    </ul>
</p>

<b style="color: #000">Can you recall a function that we've been using recently that does this? </b>

In [None]:
df.head()

<p style="font-size: 1.25rem; color: #38565c;">
    <code>groupby()</code> created a group of games which shared the same top sales region. Then, for each of these groups, we
    grabbed the "Top Sales Region" column and counted how many times it appeared. value_counts() is just a shortcut to this groupby() operation.
</p>

<p style="font-size: 1.25rem; color: #38565c;">To count the number of games per genre and platform provider, this would be the expression</p>


In [None]:
df.groupby(["Genre"])["Platform Provider"].value_counts()

In [None]:
df.groupby(["Genre"])["Platform Provider"].value_counts()

<p style="font-size: 1.25rem; color: #38565c;">The output of the <code>.groupby()</code> function is a DataFrameGroupBy object and to display the results of the operation, we need to apply some aggregation function (just like we have to do in SQL)</p>

<p style="font-size: 1.25rem; color: #38565c;">We can extend the selection of colums to group by to narrow the scope of the aggregating function. Of course, we can also change the order of the columns for our analysis</p>

In [None]:
df.groupby(["Year", "Platform Provider", "Genre"])["Global_Sales"].sum()

<p style="font-size: 1.25rem; color: #38565c;">We can also run the <code>.apply()</code> function on a groupby objecy.</p>
<p style="font-size: 1.25rem; color: #38565c;">This operation will return a DataFrame with the best seller for each year by genre</p>

In [None]:
best_seller =  df.drop(["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"], axis=1)
best_seller = best_seller.groupby(['Year', 'Genre']).apply(lambda df: df.loc[df["Global_Sales"].idxmax()].drop(["Year", "Genre"]))
best_seller.tail(30)


<h2 style="font-size: 1.5rem; color: #38565c; font-weight: bold">
    Sorting
</h2>
<p style="font-size: 1.25rem; color: #38565c;">To re-arrange our dataframe and sort it according to the value we choose, we can use the <code>.sort_values()</code> function.</p>

In [None]:
? best_seller.sort_values

<p style="font-size: 1.25rem; color: #38565c;">As you can see, the grouping is lost. To preserve the grouping we need to add some parameters to our <code>.sort_values()</code> function</p>

In [None]:
best_seller = best_seller.sort_values(["Year", "Genre", "Global_Sales"], ascending=[False, False, True])
best_seller.head(30)

<p style="font-size: 1.25rem; color: #38565c;">When we are using the <code>.groupby()</code> function, the colum values that we are using in the grouping becomes the index of the new DataFrame that's created. If we are grouping by more than one column value, we will get a DataFrame with multi-index. Dealing with mulit-label indices can be a little bit tricky. Therefore, it's common to reset the index to a single-label indices</p>

In [None]:
best_seller.head(20)

<h1 style="font-size: 2rem; color: #38565c; font-weight: bold">
    Renaming and Combining
</h1>

<h2 style="font-size: 1.5rem; color: #38565c; font-weight: bold">
    Renaming
</h2>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how easy it is to rename the entries into some prefferable format</p>


In [None]:
best_seller_swedish_cols = ["År", "Kategori", "Ranking", "Namn", "Spelplattform", "Utgivare", "Global Försäljning", "Spelplattformsleverantör", "Bästa Försäljningsregion"]
best_seller.columns = best_seller_swedish_cols

In [None]:
best_seller.head()

In [None]:
col_dict  ={
            "Spelplattform": "Platform", 
            "Utgivare":"Publisher", 
            "Spelplattformsleverantör":"Platform_Provider", 
            "År": "Year", 
            "Kategori": "Genre", 
            "Namn": "Name", 
            "Global Försäljning": "Global_Sales", 
            "Bästa Försäljningsregion":"Top_Sales_Region"
           }

<p style="font-size: 1.25rem; color: #38565c;">Do you notice any differences between the two methods?</p>
<p style="font-size: 1.25rem; color: #38565c;">When would it make sense to use the <code>.rename()</code> function?</p>

In [None]:
best_seller.rename(columns={"Global_Sales": "Sales ($M)"}, inplace=True)

In [None]:
best_seller.head(10)

<h2 style="font-size: 1.5rem; color: #38565c; font-weight: bold">
    Combining
</h2>
<p style="text-align: left; font-size: 1rem; color: #38565c;">
    When doing data analysis, it's quite common that information is separated accross multiple data sources. If we think of a general database, it's bad practice to copy data and to reserve space for null values, therefore the star-schema is popular in database design. For this reasion when performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. 
</p>    
<p style="text-align: left; font-size: 1rem; color: #38565c;">    
    Pandas has three core methods for doing
    this. In order of increasing complexity, these are <code>concat()</code>, <code>join()</code>, and <code>merge()</code>. Most of what <code>merge()</code> can do can also be done more simply with <code>join()</code>, with <code>.merge()</code> you have some more flexibilty. For example the possibility to join on column values.The <code>join()</code> function is preferrably used when the DataFrame share the same index.
    The simplest combining method is <code>concat()</code>. Given a list of elements, this function will smush those elements together along an axis.
</p>

<p style="font-size: 1.25rem; color: #38565c;">We have received more data for video games in 2019 - the best selling game by genre</p>

In [None]:
best_seller_2019 = pd.DataFrame({
    "Genre": ["Battle Royale", "Adventure", "Sports", "Strategy"],
    "Ranking": [None, None, None, None],
    "Name": ["Fortnite", "Animal Crossing", "FIFA2020", "Tom Clancy's The Division 2"],
    "Platform": ["PC", "WiiU", "PS4", "PS4"],
    "Publisher": ["Epic Games", "Nintendo", "Electronic Arts", "Ubisoft"],
    "Sales ($M)": ["58.54", "46.24", "51.32", "32.35"],
    "Platform_Provider": ["PC", "Nintendon", "Sony Playstation", "Sony Playstation"],
    "Top_Sales_Region": ["North America", "Japan", "Europe", "North America"]
    
})

In [None]:
best_seller_2019

<p style="font-size: 1.25rem; color: #38565c;">What do we need to do before we can concatenate this new data to our <code>best seller DataFrame</code>?</p>

<p style="font-size: 1.25rem; color: #38565c;">Even more data has now been made available. Now we have recieved information on <code>Average_Playtime</code> for our games in the <code>best_seller</code> DataFrame</p>

In [None]:
import numpy as np
avg_playtime_df = pd.DataFrame({"Game_Name": best_seller["Name"].values,
                               "Average_Playtime (hours)": np.around(np.random.exponential(scale=40, size=len(best_seller["Name"])), 2)})

In [None]:
avg_playtime_df.head()

<p style="font-size: 1.25rem; color: #38565c;">To merge the newly aquired information with our <code>best_seller</code> DataFram we use the <code>.merge()</code> function</p>

In [None]:
? pd.merge

In [None]:
best_seller.head()