<h1>Introduction to Pandas Python</h1>

<p><strong>Welcome!</strong> In this notebook you will practice about using <code>Pandas</code> in the Python Programming Language. By the end of this lab, you'll know how to use <code>Pandas</code> package to view and access data.</p>

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="dataset">About the Dataset</a></li>
        <li><a href="pandas">Introduction of <code>Pandas</code></a></li>
        <li><a href="data">Viewing Data and Accessing Data</a></li>
        <li><a href="quiz">Quiz on DataFrame</a></li>
    </ul>

</div>

<hr>

<h2 id="dataset">About the Dataset</h2>

The file TopSellingAlbums has a table that has one row for each album and several columns

<ul>
    <li><b>artist</b>: Name of the artist</li>
    <li><b>album</b>: Name of the album</li>
    <li><b>released_year</b>: Year the album was released</li>
    <li><b>length_min_sec</b>: Length of the album (hours,minutes,seconds)</li>
    <li><b>genre</b>: Genre of the album</li>
    <li><b>music_recording_sales_millions</b>: Music recording sales (millions in USD) on <a href="http://www.song-database.com/">[SONG://DATABASE]</a></li>
    <li><b>claimed_sales_millions</b>: Album's claimed sales (millions in USD) on <a href="http://www.song-database.com/">[SONG://DATABASE]</a></li>
    <li><b>date_released</b>: Date on which the album was released</li>
    <li><b>soundtrack</b>: Indicates if the album is the movie soundtrack (Y) or (N)</li>
    <li><b>rating_of_friends</b>: Indicates the rating from your friends from 1 to 10</li>
</ul>



<hr>

<h2 id="pandas">Introduction of <code>Pandas</code></h2>

In [1]:
# Import required library

import pandas as pd

After the import command, we now have access to a large number of pre-built classes and functions. This assumes the library is installed; in our lab environment all the necessary libraries are installed. One way pandas allows you to work with data is a dataframe. Let's go through the process to go from a comma separated values (<b>.csv</b>) file to a dataframe. This variable <code>csv_path</code> stores the path of the <b>.csv</b>, that is  used as an argument to the <code>read_csv</code> function. The result is stored in the object <code>df</code>, this is a common short form used for a variable referring to a Pandas dataframe. 

In [2]:
# Read data from the CSV file TopSellingAlbums.csv

df = pd.read_csv('TopSellingAlbums.csv')

We can use the method <code>head()</code> to examine the first five rows of a dataframe: 

In [3]:
# Print first five rows of the dataframe

df.head(5)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


 We use the path of the excel file and the function <code>read_excel</code>. The result is a data frame as before:

In [4]:
# Read data from the Excel File TopSellingAlbums.xlsx and print the first five rows
df = pd.read_excel('TopSellingAlbums.xlsx')

We can access the column <b>Length</b> and assign it a new dataframe <b>x</b>:

In [8]:
# Access to the column Length
print(df['Length'])
x = df[['Length']]
print(x)

0    00:42:19
1    00:42:11
2    00:42:49
3    00:57:44
4    00:46:33
5    00:43:08
6    01:15:54
7    00:40:01
Name: Length, dtype: object
     Length
0  00:42:19
1  00:42:11
2  00:42:49
3  00:57:44
4  00:46:33
5  00:43:08
6  01:15:54
7  00:40:01


<hr>

<h2 id="data">Viewing Data and Accessing Data</h2>

You can also get a column as a series. You can think of a Pandas series as a 1-D dataframe. Just use one bracket: 

In [9]:
# Get the column "Lenght" as a series

x = df['Length']
print(x)

0    00:42:19
1    00:42:11
2    00:42:49
3    00:57:44
4    00:46:33
5    00:43:08
6    01:15:54
7    00:40:01
Name: Length, dtype: object
0    00:42:19
1    00:42:11
2    00:42:49
3    00:57:44
4    00:46:33
5    00:43:08
6    01:15:54
7    00:40:01
Name: Length, dtype: object


You can also get a column as a dataframe. For example, we can assign the column <b>Artist</b>:

In [10]:
# Get the column "Artist" as a dataframe
print(df['Artist'])
dfartist = df[['Artist']]
print(dfartist)

0    Michael Jackson
1              AC/DC
2         Pink Floyd
3    Whitney Houston
4          Meat Loaf
5             Eagles
6           Bee Gees
7      Fleetwood Mac
Name: Artist, dtype: object
            Artist
0  Michael Jackson
1            AC/DC
2       Pink Floyd
3  Whitney Houston
4        Meat Loaf
5           Eagles
6         Bee Gees
7    Fleetwood Mac


You can do the same thing for multiple columns and obtain as result a new dataframe comprised of the specified columns:

In [31]:
# Access to multiple columns: Lenght, Artist and Genre
print(df[['Length','Artist', 'Genre']])



     Length           Artist                        Genre
0  00:42:19  Michael Jackson               pop, rock, R&B
1  00:42:11            AC/DC                    hard rock
2  00:42:49       Pink Floyd             progressive rock
3  00:57:44  Whitney Houston               R&B, soul, pop
4  00:46:33        Meat Loaf  hard rock, progressive rock
5  00:43:08           Eagles   rock, soft rock, folk rock
6  01:15:54         Bee Gees                        disco
7  00:40:01    Fleetwood Mac                    soft rock


You can also access to unique elements (check slides for more info). You can access the 1st row and the 1st column as follows:

In [39]:
# Access the value on the first row and the first column
#print(df.values[0][0])
df.iloc[0,[0]]
#df.head(5)

Artist    Michael Jackson
Name: 0, dtype: object

You can access the 2nd row and the 1st column as follows:

In [38]:
# Access the value on the second row and the first column
df.iloc[1,[0]]
#df.head(5)

Artist    AC/DC
Name: 1, dtype: object

You can access the 1st row and the 3rd column as follows: 

In [41]:
# Access the value on the first row and the third column

df.iloc[0,[2]]
#df.head(5)

Released    1982
Name: 0, dtype: object

You can access the column using the name as well, write the following, which have to be the same as above, but specifying the name of the column, rather than the number of the column: 

In [15]:
# Access the column using the name
#print(df.['Artist'])
df.loc[0,['Artist']]



Artist    Michael Jackson
Name: 0, dtype: object

In [16]:
# Access the column using the name
#e 2nd row and the 1st column as follow
df.loc[1,['Artist']]

Artist    AC/DC
Name: 1, dtype: object

In [44]:
# Access the column using the name
 #1st row and the 3rd column as follows:
df.loc[0, ['Released']]


Released    1982
Name: 0, dtype: object

In [15]:
# Access the column using the name



You can perform slicing using both the index and the name of the column:

In [63]:
# Slicing the dataframe by accessing to the first 2 rows and first 3 columns
df.iloc[0:2, 0:3]
#print(df.loc[0:1], df.[0:1])
#print(df.columns.get_loc("Artist"))
#df.head(5)


Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980


In [46]:
# Slicing the dataframe using name by accessing to the first 3 rows and first 3 columns 
# but specifying the name of the column, rather than the number of the column:
df.loc[0:2, ['Artist', 'Album', 'Released']]



Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980
2,Pink Floyd,The Dark Side of the Moon,1973


<hr>

<h2 id="quiz">Quiz on DataFrame</h2>

Use a variable <code>q</code> to store the column <b>Rating</b> as a dataframe

In [20]:
# Write your code below and press Shift+Enter to execute

q = df[['Rating']]
print(q)
#print(df.head(5))

0    10.0
1     9.5
2     9.0
3     8.5
4     8.0
5     7.5
6     7.0
7     6.5
Name: Rating, dtype: float64


Assign the variable <code>q</code> to the dataframe that is made up of the column <b>Released</b> and <b>Artist</b>:

In [21]:
# Write your code below and press Shift+Enter to execute
q = df[['Released', 'Artist']]
print(q)

(0    1982
1    1980
2    1973
3    1992
4    1977
5    1976
6    1977
7    1977
Name: Released, dtype: int64, 0    Michael Jackson
1              AC/DC
2         Pink Floyd
3    Whitney Houston
4          Meat Loaf
5             Eagles
6           Bee Gees
7      Fleetwood Mac
Name: Artist, dtype: object)


Access the 2nd row and the 3rd column of <code>df</code>:

In [30]:
# Write your code below and press Shift+Enter to execute
df.iloc[1, [2]]
#print(df.head())

Released    1980
Name: 1, dtype: object


<h2>The last exercise!</h2>


<h3>About the Authors:</h3>  
<p><a href="https://www.linkedin.com/in/joseph-s-50398b136/" target="_blank">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>

Other contributors: <a href="www.linkedin.com/in/jiahui-mavis-zhou-a4537814a">Mavis Zhou</a>

<hr>

<p>Copyright &copy; 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href="https://cognitiveclass.ai/mit-license/">MIT License</a>.</p>