# Pandas Introduction

Pandas is an open source data processing library written in C, Cython and Python. The library has a number of methods to allow you to work easily with data - reading in csv files, connecting to a SQL database and applying UDFs on data can all be done in Pandas.

Below we will import the Pandas module and specify the path of our file.

In [4]:
import pandas as pd

PATH = r'/home/tom/Documents/csv_files/pokemon.csv' #absolute path to our local CSV file

Once we have imported the library and specified the path of our file, in this case a CSV file containing information about different Pokemon, we can begin to create a Pandas DataFrame to allow us to interact with the data in an easy-to-read row and column format. Note that we are using Pandas' .read_csv method to read in our CSV file. Reading in TXT files is also done with the .read_csv method.

we can take a small sample of our data using the .head() method - which will take the first n rows of our data and display it.

In [12]:
df = pd.read_csv(PATH)
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


Now that we have loaded our CSV file into a DataFrame, we can begin to analyze and transform the data. We can find some basic statistics on our dataset using the .describe() method which will return basic statistics such as counts and quartiles on each column within our DataFrame.

In [3]:
df.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


To view the schema of the dataset in more detail and look at the data types each column contains, we can use the .dtypes attribute which returns a Pandas Series of each column along with its corresponding data type.

Note that the object data type is shown for columns with string values in. It is a common mistake that the object data type refers only to strings, but this data type will be assigned to any column whose values are anything other than floats, integers or Booleans - which means any JSON/array values will be counted as objects

In [8]:
df.dtypes 

#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

Now that we have a basic understanding of our data, we can begin to perform some transformations. In this example, we will be dropping the '#' column within our dataset and renaming any columns which have whitespace between the column name, as well as dropping any NULL values within our data.

In [13]:
df.dropna(inplace = True) #'inplace' argument basically means reassign, it is the same as doing df = df.dropna()
df.drop('#', inplace = True, axis = 1) #will remove the '#' column from our dataset

df = df.rename(columns = {'Type 1': 'type_1', 'Type 2': 'type_2', 'Sp. Atk': 'special_attack',
                    'Sp. Def': 'special_defense'}) #takes a dictionary that maps old values to new ones
df.head()

Unnamed: 0,Name,type_1,type_2,Total,HP,Attack,Defense,special_attack,special_defense,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False


As you can see, the transformations we applied have taken effect and now we have a transformed Pandas DataFrame. We can now export this to a CSV file onto our local machine using the .to_csv() method.

In [14]:
df.to_csv("/home/tom/Documents/csv_files/transformed_pokemon_data.csv") #takes file path to save as

# Summary

In this notebook we learnt how to read in a CSV file or TXT file using the .read_csv() method, discover basic statistics about our data using the .describe() method and the .dtypes attribute, and perform basic transformations on our data such as dropping NULL values, renaming columns and dropping specified columns - as well as exporting this transformed data to a CSV file.