## Pandas Basics

The table has one row for each album and several columns.

- **artist:** Name of the artist
- **album:** Name of the album
- **released_year:** Year the album was released
- **length_min_sec:** Length of the album (hours,minutes,seconds)
- **genre:** Genre of the album
- **music_recording_sales_millions:** Music recording sales (millions in USD) on [SONG://DATABASE]
- **claimed_sales_millions:** Album's claimed sales (millions in USD) on [SONG://DATABASE]
- **date_released:** Date on which the album was released
- **soundtrack:** Indicates if the album is the movie soundtrack (Y) or (N)
- **rating_of_friends:** Indicates the rating from your friends from 1 to 10


variable = feature = column

records = row

dataframe = dataset


### **importing dataset**

In [57]:
import pandas as pd

In [142]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Disable line wrapping
pd.set_option('display.max_colwidth', None)  # Show full column width

In [145]:
# Set display options
pd.set_option('display.float_format', '{:.6f}'.format)  # To display floats without scientific notation
pd.set_option('display.max_columns', None)  # To display all columns

In [58]:
# read dataset
df = pd.read_csv("TopSellingAlbums.csv")

In [59]:
type(df)

pandas.core.frame.DataFrame

In [60]:
# display the top 5 rows
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


In [61]:
# display the last 5 rows
df.tail()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5
6,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5


In [62]:
# reading excel file
# df_excel = pd.read_excel("")

- we can access the column "Length" and assign it a new dataframe 'x'

In [63]:
# 2-D Array
x = df[['Length']]
type(x)

pandas.core.frame.DataFrame

In [64]:
x

Unnamed: 0,Length
0,0:42:19
1,0:42:11
2,0:42:49
3,0:57:44
4,0:46:33
5,0:43:08
6,1:15:54
7,0:40:01


- you can also assign the value to a Series, you can think of python Series as a 1-D dataframe. Just One Bracket.

In [65]:
y = df['Album']
type(y)

pandas.core.series.Series

In [66]:
y

0                           Thriller
1                      Back in Black
2          The Dark Side of the Moon
3                      The Bodyguard
4                    Bat Out of Hell
5    Their Greatest Hits (1971-1975)
6               Saturday Night Fever
7                            Rumours
Name: Album, dtype: object

In [67]:
y.values

array(['Thriller', 'Back in Black', 'The Dark Side of the Moon',
       'The Bodyguard', 'Bat Out of Hell',
       'Their Greatest Hits (1971-1975)', 'Saturday Night Fever',
       'Rumours'], dtype=object)

- you can also convert the above array to list

In [68]:
y.values.tolist()

['Thriller',
 'Back in Black',
 'The Dark Side of the Moon',
 'The Bodyguard',
 'Bat Out of Hell',
 'Their Greatest Hits (1971-1975)',
 'Saturday Night Fever',
 'Rumours']

- selecting multiple columns

In [69]:
# for selecting multiple columns always use double square brackets
y = df[['Length', 'Artist', 'Genre']]
y

Unnamed: 0,Length,Artist,Genre
0,0:42:19,Michael Jackson,"pop, rock, R&B"
1,0:42:11,AC/DC,hard rock
2,0:42:49,Pink Floyd,progressive rock
3,0:57:44,Whitney Houston,"R&B, soul, pop"
4,0:46:33,Meat Loaf,"hard rock, progressive rock"
5,0:43:08,Eagles,"rock, soft rock, folk rock"
6,1:15:54,Bee Gees,disco
7,0:40:01,Fleetwood Mac,soft rock


- One way to access unique elements is the `iloc` & `loc` method, where you can access the 1st row and 1st column as follows:
- It's like indexing and slicing.
- These two are concepts.

**iloc** -> index location

df.iloc[row , column]

In [70]:
df.iloc[0,2]

1982

In [71]:
df.iloc[3:5, 3:6]

Unnamed: 0,Length,Genre,Music Recording Sales (millions)
3,0:57:44,"R&B, soul, pop",27.4
4,0:46:33,"hard rock, progressive rock",20.6


In [72]:
df.iloc[0, 0:2]

Artist    Michael Jackson
Album            Thriller
Name: 0, dtype: object

In [73]:
df.iloc[6:8, 4:6]

Unnamed: 0,Genre,Music Recording Sales (millions)
6,disco,20.6
7,soft rock,27.9


In [74]:
df.iloc[0:2, 0:6]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions)
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1


In [75]:
# giving only rows 
df.iloc[0:2]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5


In [76]:
# giving only columns
df.iloc[:,0:3]

Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980
2,Pink Floyd,The Dark Side of the Moon,1973
3,Whitney Houston,The Bodyguard,1992
4,Meat Loaf,Bat Out of Hell,1977
5,Eagles,Their Greatest Hits (1971-1975),1976
6,Bee Gees,Saturday Night Fever,1977
7,Fleetwood Mac,Rumours,1977


- we can also provide the rows and columns in list form.

In [77]:
df.iloc[[6,2,5], [5,1]]

Unnamed: 0,Music Recording Sales (millions),Album
6,20.6,Saturday Night Fever
2,24.2,The Dark Side of the Moon
5,32.2,Their Greatest Hits (1971-1975)


- There is another method call `loc` which uses names of rows and column indexes.
- `loc` 

In [78]:
# if the indexing is in a,b,c,d,e,f... then you will provide these values in row in loc
# as now we have 0,1,2,3...
df.loc[0:2, 'Artist':'Genre']

Unnamed: 0,Artist,Album,Released,Length,Genre
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B"
1,AC/DC,Back in Black,1980,0:42:11,hard rock
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock


## Adding Column

In [79]:
df['New Artist'] = df['Artist']

In [80]:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,New Artist
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0,Michael Jackson
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5,AC/DC
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0,Pink Floyd
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5,Whitney Houston
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0,Meat Loaf


In [81]:
# adding on specific location
df.insert(2, 'new_col', 0)
df

Unnamed: 0,Artist,Album,new_col,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,New Artist
0,Michael Jackson,Thriller,0,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0,Michael Jackson
1,AC/DC,Back in Black,0,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5,AC/DC
2,Pink Floyd,The Dark Side of the Moon,0,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0,Pink Floyd
3,Whitney Houston,The Bodyguard,0,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5,Whitney Houston
4,Meat Loaf,Bat Out of Hell,0,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0,Meat Loaf
5,Eagles,Their Greatest Hits (1971-1975),0,1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5,Eagles
6,Bee Gees,Saturday Night Fever,0,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0,Bee Gees
7,Fleetwood Mac,Rumours,0,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5,Fleetwood Mac


In [82]:
df.dtypes

Artist                               object
Album                                object
new_col                               int64
Released                              int64
Length                               object
Genre                                object
Music Recording Sales (millions)    float64
Claimed Sales (millions)              int64
Released.1                           object
Soundtrack                           object
Rating                              float64
New Artist                           object
dtype: object

In [83]:
# changing the datatype of Released.1 to date time
# df['Released.1'] = pd.to_datetime(df['Released.1'], format='%Y%m%d')
df['Released.1'] = pd.to_datetime(df['Released.1'])

  df['Released.1'] = pd.to_datetime(df['Released.1'])


In [84]:
df.dtypes

Artist                                      object
Album                                       object
new_col                                      int64
Released                                     int64
Length                                      object
Genre                                       object
Music Recording Sales (millions)           float64
Claimed Sales (millions)                     int64
Released.1                          datetime64[ns]
Soundtrack                                  object
Rating                                     float64
New Artist                                  object
dtype: object

In [85]:
# extracting months from date time
df['Month'] = df['Released.1'].dt.month
df.head(2)

Unnamed: 0,Artist,Album,new_col,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,New Artist,Month
0,Michael Jackson,Thriller,0,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0,Michael Jackson,11
1,AC/DC,Back in Black,0,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5,AC/DC,7


In [86]:
# date time library
import datetime as dt

In [87]:
dt.datetime.weekday

<method 'weekday' of 'datetime.date' objects>

## Dropping Column

In [88]:
df.drop(['New Artist', 'new_col', 'Month'], axis=1, inplace=True)

In [89]:
df.head(1)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0


## Object type of each Column

In [90]:
df.dtypes

Artist                                      object
Album                                       object
Released                                     int64
Length                                      object
Genre                                       object
Music Recording Sales (millions)           float64
Claimed Sales (millions)                     int64
Released.1                          datetime64[ns]
Soundtrack                                  object
Rating                                     float64
dtype: object

## Null Values Check in Data Frame

In [91]:
df['Released'].value_counts()

Released
1977    3
1982    1
1980    1
1973    1
1992    1
1976    1
Name: count, dtype: int64

In [92]:
df.isnull().sum()

Artist                              0
Album                               0
Released                            0
Length                              0
Genre                               0
Music Recording Sales (millions)    0
Claimed Sales (millions)            0
Released.1                          0
Soundtrack                          6
Rating                              0
dtype: int64

## Summary Statistics

In [93]:
df.describe()

Unnamed: 0,Released,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
count,8.0,8.0,8.0,8,8.0
mean,1979.25,28.125,46.125,1992-04-20 12:00:00,8.25
min,1973.0,20.6,40.0,1976-02-17 00:00:00,6.5
25%,1976.75,23.3,41.5,1977-08-17 06:00:00,7.375
50%,1977.0,26.75,43.5,1979-03-21 12:00:00,8.25
75%,1980.5,28.975,46.25,1985-05-28 00:00:00,9.125
max,1992.0,46.0,65.0,2073-03-01 00:00:00,10.0
std,5.800246,8.189322,8.271077,,1.224745


In [94]:
df.describe(include='all')

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
count,8,8,8.0,8,8,8.0,8.0,8,2,8.0
unique,8,8,,8,8,,,,1,
top,Michael Jackson,Thriller,,0:42:19,"pop, rock, R&B",,,,Y,
freq,1,1,,1,1,,,,2,
mean,,,1979.25,,,28.125,46.125,1992-04-20 12:00:00,,8.25
min,,,1973.0,,,20.6,40.0,1976-02-17 00:00:00,,6.5
25%,,,1976.75,,,23.3,41.5,1977-08-17 06:00:00,,7.375
50%,,,1977.0,,,26.75,43.5,1979-03-21 12:00:00,,8.25
75%,,,1980.5,,,28.975,46.25,1985-05-28 00:00:00,,9.125
max,,,1992.0,,,46.0,65.0,2073-03-01 00:00:00,,10.0


In [95]:
df.describe(include='O')  # O = Object

Unnamed: 0,Artist,Album,Length,Genre,Soundtrack
count,8,8,8,8,2
unique,8,8,8,8,1
top,Michael Jackson,Thriller,0:42:19,"pop, rock, R&B",Y
freq,1,1,1,1,2


## Querying a DataFrame

Finding values based on certain condition.

In [96]:
df[df['Rating'] >= 9.0]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0


In [97]:
# Show Albums where rating is greater than or equal to 9
df[df['Rating'] >= 9.0][['Album']]

Unnamed: 0,Album
0,Thriller
1,Back in Black
2,The Dark Side of the Moon


In [98]:
df[df['Rating'] >= 9.0][['Album']].count()

Album    3
dtype: int64

In [99]:
df[df['Rating'] >= 9.0][['Album', 'Released', 'Artist']]

Unnamed: 0,Album,Released,Artist
0,Thriller,1982,Michael Jackson
1,Back in Black,1980,AC/DC
2,The Dark Side of the Moon,1973,Pink Floyd


In [100]:
df.head(2)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5


- Sorting columns/features

In [101]:
df = df.sort_values('Released', ascending=True)
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,,7.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0
6,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,1977-11-15,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,1977-02-04,,6.5


- In ML the index location matters a alot.
- When you sort the values the index position changes.
- When you delete rows the index position changes.
- We need to reset index.

In [102]:
# Reset the index
df.reset_index(drop=True, inplace=True)

In [103]:
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0
1,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,,7.5
2,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0
3,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,1977-11-15,Y,7.0
4,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,1977-02-04,,6.5
5,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5
6,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
7,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5


In [104]:
# multiple condition with AND (&) & OR (|)
df[(df['Rating'] >= 9.0) | (df['Soundtrack'] == 'Y')]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0
3,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,1977-11-15,Y,7.0
5,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5
6,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
7,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5


In [105]:
df[((df['Rating'] >= 9.0) & (df['Released'] == 1973)) | (df['Soundtrack'] == 'Y')]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0
3,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,1977-11-15,Y,7.0
7,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5


In [106]:
# in loc we can apply condition only on rows
df.loc[df['Rating'] >= 9.0, 'Album':'Genre']

Unnamed: 0,Album,Released,Length,Genre
0,The Dark Side of the Moon,1973,0:42:49,progressive rock
5,Back in Black,1980,0:42:11,hard rock
6,Thriller,1982,0:42:19,"pop, rock, R&B"


- Applying Condition using loc and inputing values infront of each condition.

In [107]:
df.loc[df['Rating'] >= 9.0, 'Rating Group'] = 'Highest'
df.loc[(df['Rating'] < 9.0) & (df['Rating'] >= 7), 'Rating Group'] = 'Middle'
df.loc[df['Rating'] < 7, 'Rating Group'] = 'Lowest'

In [108]:
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,Rating Group
0,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,2073-03-01,,9.0,Highest
1,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,,7.5,Middle
2,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0,Middle
3,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,1977-11-15,Y,7.0,Middle
4,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,1977-02-04,,6.5,Lowest
5,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5,Highest
6,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0,Highest
7,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5,Middle


In [109]:
soundtrack_albums = df.loc[df['Rating'] >= 9.0, ['Album']]
soundtrack_albums

Unnamed: 0,Album
0,The Dark Side of the Moon
5,Back in Black
6,Thriller


- Find out the albums released during and after the year 1980

In [110]:
df[df['Released'] >= 1980]

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating,Rating Group
5,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5,Highest
6,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0,Highest
7,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5,Middle


In [111]:
df[df['Released'] >= 1980][['Album']]

Unnamed: 0,Album
5,Back in Black
6,Thriller
7,The Bodyguard


In [113]:
df_cars = pd.read_csv('cars.csv')
df_cars.head(2)

Unnamed: 0,Make,Model,Variant,Ex-Showroom_Price,Displacement,Cylinders,Valves_Per_Cylinder,Drivetrain,Cylinder_Configuration,Emission_Norm,...,Leather_Wrapped_Steering,Automatic_Headlamps,Engine_Type,ASR_/_Traction_Control,Cruise_Control,USB_Ports,Heads-Up_Display,Welcome_Lights,Battery,Electric_Range
0,Tata,Nano Genx,Xt,"Rs. 2,92,667",624 cc,2.0,2.0,RWD (Rear Wheel Drive),In-line,BS IV,...,,,,,,,,,,
1,Tata,Nano Genx,Xe,"Rs. 2,36,447",624 cc,2.0,2.0,RWD (Rear Wheel Drive),In-line,BS IV,...,,,,,,,,,,


In [114]:
df_cars.describe()

Unnamed: 0,Cylinders,Valves_Per_Cylinder,Doors,Seating_Capacity,Number_of_Airbags,USB_Ports
count,1210.0,1174.0,1272.0,1270.0,1141.0,29.0
mean,4.380992,3.977853,4.550314,5.270079,3.787029,1.793103
std,1.660957,0.833763,0.747816,1.145231,2.522399,0.773642
min,2.0,1.0,2.0,2.0,1.0,1.0
25%,4.0,4.0,4.0,5.0,2.0,1.0
50%,4.0,4.0,5.0,5.0,2.0,2.0
75%,4.0,4.0,5.0,5.0,6.0,2.0
max,16.0,16.0,5.0,16.0,14.0,3.0


In [118]:
df_cars.columns

Index(['Make', 'Model', 'Variant', 'Ex-Showroom_Price', 'Displacement',
       'Cylinders', 'Valves_Per_Cylinder', 'Drivetrain',
       'Cylinder_Configuration', 'Emission_Norm',
       ...
       'Leather_Wrapped_Steering', 'Automatic_Headlamps', 'Engine_Type',
       'ASR_/_Traction_Control', 'Cruise_Control', 'USB_Ports',
       'Heads-Up_Display', 'Welcome_Lights', 'Battery', 'Electric_Range'],
      dtype='object', length=140)

### Grouping

Grouping works similar to "group by" as in databases. Pandas also provides a "group by" function which serves the same purpose.

In [136]:
# Set display options
pd.set_option('display.float_format', '{:.6f}'.format)  # To display floats without scientific notation
pd.set_option('display.max_columns', None)  # To display all columns

In [137]:
# Remove any non-numeric characters and convert to float
df_cars['Ex-Showroom_Price'] = df_cars['Ex-Showroom_Price'].replace({'Rs.': '', ',': ''}, regex=True).astype(float)

In [138]:
df_cars.head(2)

Unnamed: 0,Make,Model,Variant,Ex-Showroom_Price,Displacement,Cylinders,Valves_Per_Cylinder,Drivetrain,Cylinder_Configuration,Emission_Norm,Engine_Location,Fuel_System,Fuel_Tank_Capacity,Fuel_Type,Height,Length,Width,Body_Type,Doors,City_Mileage,Highway_Mileage,ARAI_Certified_Mileage,ARAI_Certified_Mileage_for_CNG,Kerb_Weight,Gears,Ground_Clearance,Front_Brakes,Rear_Brakes,Front_Suspension,Rear_Suspension,Front_Track,Rear_Track,Front_Tyre_&_Rim,Rear_Tyre_&_Rim,Power_Steering,Power_Windows,Power_Seats,Keyless_Entry,Power,Torque,Odometer,Speedometer,Tachometer,Tripmeter,Seating_Capacity,Seats_Material,Type,Wheelbase,Wheels_Size,Start_/_Stop_Button,12v_Power_Outlet,Audiosystem,Aux-in_Compatibility,Average_Fuel_Consumption,Basic_Warranty,Bluetooth,Boot-lid_Opener,Boot_Space,CD_/_MP3_/_DVD_Player,Central_Locking,Child_Safety_Locks,Clock,Cup_Holders,Distance_to_Empty,Door_Pockets,Engine_Malfunction_Light,Extended_Warranty,FM_Radio,Fuel-lid_Opener,Fuel_Gauge,Handbrake,Instrument_Console,Low_Fuel_Warning,Minimum_Turning_Radius,Multifunction_Display,Sun_Visor,Third_Row_AC_Vents,Ventilation_System,Auto-Dimming_Rear-View_Mirror,Hill_Assist,Gear_Indicator,3_Point_Seat-Belt_in_Middle_Rear_Seat,Ambient_Lightning,Cargo/Boot_Lights,Drive_Modes,Engine_Immobilizer,High_Speed_Alert_System,Lane_Watch_Camera/_Side_Mirror_Camera,Passenger_Side_Seat-Belt_Reminder,Seat_Back_Pockets,Voice_Recognition,Walk_Away_Auto_Car_Lock,ABS_(Anti-lock_Braking_System),Headlight_Reminder,Adjustable_Headrests,Gross_Vehicle_Weight,Airbags,Door_Ajar_Warning,EBD_(Electronic_Brake-force_Distribution),Fasten_Seat_Belt_Warning,Gear_Shift_Reminder,Number_of_Airbags,Compression_Ratio,Adjustable_Steering_Column,Other_Specs,Other_specs,Parking_Assistance,Key_Off_Reminder,USB_Compatibility,Android_Auto,Apple_CarPlay,Cigarette_Lighter,Infotainment_Screen,Multifunction_Steering_Wheel,Average_Speed,EBA_(Electronic_Brake_Assist),Seat_Height_Adjustment,Navigation_System,Second_Row_AC_Vents,Tyre_Pressure_Monitoring_System,Rear_Center_Armrest,iPod_Compatibility,ESP_(Electronic_Stability_Program),Cooled_Glove_Box,Recommended_Tyre_Pressure,Heated_Seats,Turbocharger,ISOFIX_(Child-Seat_Mount),Rain_Sensing_Wipers,Paddle_Shifters,Leather_Wrapped_Steering,Automatic_Headlamps,Engine_Type,ASR_/_Traction_Control,Cruise_Control,USB_Ports,Heads-Up_Display,Welcome_Lights,Battery,Electric_Range
0,Tata,Nano Genx,Xt,292667.0,624 cc,2.0,2.0,RWD (Rear Wheel Drive),In-line,BS IV,"Rear, Transverse",Injection,24 litres,Petrol,1652 mm,3164 mm,1750 mm,Hatchback,5.0,?23.6 km/litre,,23.6 km/litre,,660 kg,4,180 mm,Drum,Drum,"Independent, Lower Wishbone, McPherson Strut w...","Independent, Semi Trailing arm with coil sprin...",1325 mm,1315 mm,135/70R12,155/65R12,Electric Power,Only Front Windows,,Remote,38PS@5500rpm,51Nm@4000rpm,Digital,Analog,Not on offer,Yes,4.0,Fabric,Manual,2230 mm,4 B X 12,Yes,Yes,CD Player with USB & Aux-in,Yes,Yes,2 years /75000 Kms (years/distance whichever c...,Yes,Internal,110 litres,Yes,Yes,Yes,Digital,Front,Yes,Front,Yes,2 years /150000 Kms (years/distance whichever ...,Yes,Internal,Digital,Manual,Analog + Digital,Yes,4 meter,Yes,Driver & Front Passenger,Not Applicable,Manual Air conditioning with cooling and heating,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Tata,Nano Genx,Xe,236447.0,624 cc,2.0,2.0,RWD (Rear Wheel Drive),In-line,BS IV,"Rear, Transverse",Injection,24 litres,Petrol,1652 mm,3164 mm,1750 mm,Hatchback,5.0,?23.6 km/litre,,23.6 km/litre,,725 kg,4,180 mm,Drum,Drum,"Independent, Lower Wishbone, McPherson Strut w...","Independent, Semi Trailing arm with coil sprin...",1325 mm,1315 mm,135/70R12,155/65R12,,,,,38PS@5500rpm,51Nm@4000rpm,Digital,Analog,Not on offer,Yes,4.0,Fabric,Manual,2230 mm,4 B X 12,,Yes,Not on offer,,Yes,2 years /75000 Kms (years/distance whichever c...,,Internal,110 litres,,,Yes,Digital,Front,Yes,Front,Yes,2 years /150000 Kms (years/distance whichever ...,,Internal,Digital,Manual,Analog + Digital,Yes,4 meter,Yes,Driver & Front Passenger,Not Applicable,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [139]:
# getting average prices of cars on Drivetrain
df_cars.groupby('Drivetrain')['Ex-Showroom_Price'].mean()

Drivetrain
4WD                       12165326.389831
AWD (All Wheel Drive)     14695693.307190
FWD (Front Wheel Drive)    1213478.047351
RWD (Rear Wheel Drive)    10264286.252941
Name: Ex-Showroom_Price, dtype: float64

In [140]:
df_cars.groupby('Variant')['Ex-Showroom_Price'].mean()

Variant
1.0 S                        390000.000000
1.0 S Amt                    437065.000000
1.0 Turbo Gdi Dct S          940000.000000
1.0 Turbo Gdi Dct Sx Plus   1115500.000000
1.0 Turbo Gdi Mt S           826000.000000
                                 ...      
Zxi Amt                      633867.500000
Zxi Amt (O)                  552350.000000
Zxi At                       896056.500000
Zxi Plus                     850204.333333
Zxi Plus Amt                 835306.500000
Name: Ex-Showroom_Price, Length: 1064, dtype: float64

In [143]:
df_cars.groupby(['Make', 'Drivetrain'])['Ex-Showroom_Price'].mean()

In [148]:
# shows in dataframe form
group = df_cars.groupby(['Make', 'Variant', 'Drivetrain'])['Ex-Showroom_Price'].mean().reset_index()

In [None]:
group.sort_values(['Ex-Showroom_Price'], ascending=False)

In [150]:
df_cars.dtypes

Make                                          object
Model                                         object
Variant                                       object
Ex-Showroom_Price                            float64
Displacement                                  object
Cylinders                                    float64
Valves_Per_Cylinder                          float64
Drivetrain                                    object
Cylinder_Configuration                        object
Emission_Norm                                 object
Engine_Location                               object
Fuel_System                                   object
Fuel_Tank_Capacity                            object
Fuel_Type                                     object
Height                                        object
Length                                        object
Width                                         object
Body_Type                                     object
Doors                                        f

### Named Aggregation

In [None]:
df_cars.groupby(['Make', 'Drivetrain'])['Ex-Showroom_Price'].mean().reset_index()