# 2. Advanced Calculations



**2.1. Data type conversions using pandas**

One area where you might encounter some hurdles with pandas is dealing with data types. pandas is generally pretty good at assigning proper data types but, nonetheless, you'll find many instances when you need to convert data types. 

After importing pandas, we'll read the planets data as a CSV. (typing) Now, let's have a peek at the data. 

In [2]:
# import pandas

import pandas as pd

In [3]:
# import data
planets = pd.read_csv('planets.csv')

In [5]:
planets.head(3)

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011


From looking over the data frame, you can probably infer what the data type assignments will be, but to be sure we can access the dtypes attribute of planets. Now, we see data types varying from an object to integers to floats. 

In [6]:
planets.dtypes

method             object
number              int64
orbital_period    float64
mass              float64
distance          float64
year                int64
dtype: object

How pandas handles your data depends on the data types you've designated. For example, we'll use the mean function to return the average for all float and integer values in our dataset. Everything looks pretty good, though you might question whether it really makes sense to take an average of a year, as we've done here. 

In [13]:
planets.mean(numeric_only=True)

number               1.785507
orbital_period    2002.917596
mass                 2.638161
distance           264.069282
year              2009.070531
dtype: float64

Let's see how different data types interact. Here, we'll divide an integer column by a float. The result is a float, great, that's what you'd hope for. 

In [14]:
planets['number'][0]/planets['mass'][0]

np.float64(0.14084507042253522)

We also have the ability to change data types using the astype function. For instance, we can convert the integer value of the number column to a float. 

In [15]:
planets['number'][0].astype(float)

np.float64(1.0)

It's useful to see what happens when you convert a float to an int. In this case, we've lost the decimal point and it's worth noting that this approach would effectively round down any floats as you convert to integers. 

In [16]:
planets['mass'][0].astype(int)

np.int64(7)

We could also convert the year to an object by calling the astype(str) for string. 

In [18]:
planets['year'][0].astype(str)

np.str_('2006')

To take advantage of the date time data type in pandas we can convert our integer year value to a date time using to_datetime. Then, we specify how the data is currently formatted. So, you see here, our integer year has been converted to an actual date time corresponding to the first day of the year. I want you to unlock the full potential of pandas but before you can, there's a good chance you'll encounter some data type issues and now you'll be ready to solve them.

In [19]:
planets['year_dt'] = pd.to_datetime(planets['year'], format='%Y' )
planets['year_dt']

0      2006-01-01
1      2008-01-01
2      2011-01-01
3      2007-01-01
4      2009-01-01
          ...    
1030   2006-01-01
1031   2007-01-01
1032   2007-01-01
1033   2008-01-01
1034   2008-01-01
Name: year_dt, Length: 1035, dtype: datetime64[ns]

**2.2. Working with Strings**
Text data can be an incredibly rich source of data for analysis and Pandas is a well-equipped for working with, cleaning, and processing text data in string format. Let's dive into some useful methods. 

In Pandas, the string accessor denoted by .str enables a host of useful string transformations. Let's start by replacing that semi-colon. Great, you can see the replace function easily replaced the semi-colon with the comma. 

In [20]:
import pandas as pd

In [43]:
names = pd.Series(['Pomeray, CODY ',' Wagner; Jarry','smith, Ray'])
names

0    Pomeray, CODY 
1     Wagner; Jarry
2        smith, Ray
dtype: object

In [44]:
names = names.str.replace(';', ',')
names

0    Pomeray, CODY 
1     Wagner, Jarry
2        smith, Ray
dtype: object

Now another string operator that can be useful is calling the length in the number of characters of each string in your series. Great, so we see the first two names have 14 characters a piece and the third name has 10 characters.

In [36]:
names.str.len()

0    14
1    14
2    10
dtype: int64

Now, I noticed some trailing and leading spaces in the names, so this is a great opportunity to use strip to remove those spaces. We'll also return the length so we can see the difference. Looks like we trimmed a leading or trailing space off of each of the first two names. 

In [45]:
names = names.str.strip()
names.str.len()

0    13
1    13
2    10
dtype: int64

For consistency, we'll also go ahead and convert to entirely lower case using the .str.lower. Note, .str.upper acts exactly as you think it might. 

In [46]:
# lower letters

names = names.str.lower()
names

0    pomeray, cody
1    wagner, jarry
2       smith, ray
dtype: object

In [39]:
# upper letters
names = names.str.upper()
names

0    POMERAY, CODY
1    WAGNER, JARRY
2       SMITH, RAY
dtype: object

Next, I want to reverse the order of first and last name. Thankfully, we have a comma delimiting first and last, so we can use .str.split to separate these two. This creates what is known as a tupal of last name then first name for each of our names.

In [47]:
names = names.str.split(', ')
names

0    [pomeray, cody]
1    [wagner, jarry]
2       [smith, ray]
dtype: object

This next trick is a clever method using list comprehension to swap the order of last name and first name for each name. 

In [48]:
names = pd.Series([i[::-1] for i in names])
names

0    [cody, pomeray]
1    [jarry, wagner]
2       [ray, smith]
dtype: object

Our last step is to again join first and last name, separated by a space. This gives us a list of names in order of first then last and its all been tidied up a bit. Now, when you encounter text data in your work, I encourage you to put these Panda string functions to work for you.

In [49]:
names = [' '.join(i) for i in names]
names

['cody pomeray', 'jarry wagner', 'ray smith']

**2.3. Working with dates using pandas

Time series data is one of the most interesting and essential types of data that we work with. But dates can be tricky to deal with. Thankfully, Pandas has some excellent methods that we can put to use. To get started, we're going to generate a series of dates. 

The period_range function in Pandas allows us to do just that. By specifying the starting date, followed by the frequency, and the number of periods, we return a series of dates which we'll pass into a data frame. Great. Now we've got a data frame of four dates starting January 1st, 2020. And each is separated by 30 days. 

In [50]:
import pandas as pd

In [65]:
dataerange = pd.period_range('1/1/2020', freq='30d', periods=4)


In [75]:
date_df = pd.DataFrame(data=dataerange, columns=['sample date'])
date_df

Unnamed: 0,sample date
0,2020-01-01
1,2020-01-31
2,2020-03-01
3,2020-03-31


Date difference from prior date using `diff`

One useful function when working with time series data is the diff function. Diff will calculate the difference from a prior period. And in this sense operates similar to a SQL lag function. Let's see the difference from the prior date in our data. Sure enough, our dates are 30 days apart. 

In [76]:
# calculate difference between dates

date_df['date difference'] = date_df['sample date'].diff(periods = 1)
date_df

  new_data = np.array([self.freq.base * x for x in new_i8_data])


Unnamed: 0,sample date,date difference
0,2020-01-01,NaT
1,2020-01-31,<30 * Days>
2,2020-03-01,<30 * Days>
3,2020-03-31,<30 * Days>


`Find the first day of the month`

Now often you'll want to take a date and convert it to the first day of the month. Similar to using a date trunc function in SQL. One easy method is to access the values property of our date column. Then use astype datetime64 to convert to a date time. By passing M between the square brackets, our date is now at the first of the month. 

In [77]:
# Find the first day of the month

date_df['first of month'] = date_df['sample date'].values.astype('datetime64[M]')
date_df

Unnamed: 0,sample date,date difference,first of month
0,2020-01-01,NaT,2020-01-01
1,2020-01-31,<30 * Days>,2020-01-01
2,2020-03-01,<30 * Days>,2020-03-01
3,2020-03-31,<30 * Days>,2020-03-01


Now let's quickly check our data types. You'll notice our original date is actually a period data type. We'll go ahead and convert that to a datetime64 timestamp using the dt accessor and the to_timestamp function. This will help with some further transformations we'll want to use. 


In [78]:
# date types

date_df.dtypes

sample date          period[30D]
date difference           object
first of month     datetime64[s]
dtype: object

In [79]:
date_df['sample date'] = date_df['sample date'].dt.to_timestamp()
date_df.dtypes

sample date        datetime64[ns]
date difference            object
first of month      datetime64[s]
dtype: object

`Date Subtraction`

Now to subtract two dates, no special treatment is involved. Just subtract them. Here we'll see the number of days between our date and the first of the month. Makes sense. In a similar fashion we can even subtract our date difference from above, and get the expected outcome. Similarly, you can use the Timedelta function to specify a time span you want to add or subtract from your date. 

In [80]:
# date subtraction
date_df['sample date'] - date_df['first of month']

0    0 days
1   30 days
2    0 days
3   30 days
dtype: timedelta64[ns]

In [81]:
date_df['sample date'] - date_df['date difference']

  date_df['sample date'] - date_df['date difference']


0                    NaT
1    2020-01-01 00:00:00
2    2020-01-31 00:00:00
3    2020-03-01 00:00:00
dtype: object

In [82]:
date_df['sample date'] - pd.Timedelta('30 d')

0   2019-12-02
1   2020-01-01
2   2020-01-31
3   2020-03-01
Name: sample date, dtype: datetime64[ns]

Lastly, when working with date data types, Pandas has several quick tricks accessible with the dt accessor. Here we'll return the actual day name corresponding to each of our dates with the dt.day_name function. Very cool. Well, that's it for dates. Definitely check out the many other useful dt functions which are detailed in Pandas' thorough documentation online.

In [83]:
date_df['sample date'].dt.day_name()

0    Wednesday
1       Friday
2       Sunday
3      Tuesday
Name: sample date, dtype: object

**2.4 Working with missing data**

When you first dive into your dataset, you may be surprised to find that some data simply isn't there at all. How you proceed to treat your data will have important ramifications down the line in your analysis. Let's review some approaches to dealing with missing data in Pandas. 

First, we'll create a data frame with temperature measurements. Here it is. Note the two missing values in sequence number four.



In [84]:
import pandas as pd

In [85]:
temps = pd.DataFrame({ "sequence":[1,2,3,4,5],
                      "measurement_type":['actual', 'actual', 'actual', None, 'estimated'],
                      "temperature_f": [67.24, 84.56, 91.61, None, 49.64]
})
temps

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,
4,5,estimated,49.64


Using `isna()` to identify null values in a dataframe

One method to quickly identify all missing values in your data frame is to call isna. This will return true for any cells containing a missing value. Generally, the default parameters in Pandas functions are built to handle null values. For example, sometimes we'll treat nulls as zero and means ignore null values by default. 

In [86]:
# isna() function

temps.isna()

Unnamed: 0,sequence,measurement_type,temperature_f
0,False,False,False
1,False,False,False
2,False,False,False
3,False,True,True
4,False,False,False


Let's see an example using a cumulative sum down our data frame By default, the cumulative sum skips nulls. Now, if we set, skipna equal to false, the cumulative sum will null all subsequent results after the first null. One case where you'll need to be mindful of how Pandas treats nulls is when aggregating your data using group by. The default behavior is to exclude any records with no values for any dimensions you're grouping by. 

In [87]:
temps['temperature_f'].cumsum()

0     67.24
1    151.80
2    243.41
3       NaN
4    293.05
Name: temperature_f, dtype: float64

Let's see an example. Notice our entry with no measurement was not included. To prevent the group by from dropping nulls, pass dropna equal to false. 

In [89]:
# can specify to retain NA dimensions in grouping
temps.groupby(by=['measurement_type']).max()

Unnamed: 0_level_0,sequence,temperature_f
measurement_type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64


In [90]:
# can specify to retain NA dimensions in grouping
temps.groupby(by=['measurement_type'], dropna=False).max()

Unnamed: 0_level_0,sequence,temperature_f
measurement_type,Unnamed: 1_level_1,Unnamed: 2_level_1
actual,3,91.61
estimated,5,49.64
,4,


Dealing with missing data: The blunt approach using `dropna()`

Great, now let's review some methods to treat these nulls before you get too far along in your analysis. The most straightforward method is to simply drop records with null using dropna. Note, this method isn't without repercussions and you should consider this carefully before using this approach. By calling dropna, the default behavior is to drop any rows which contain null values in any column. Here, you can see that sequence four was dropped. 

In [91]:
#drop rows with null using axis=0 (default)
temps.dropna()

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
4,5,estimated,49.64


Now, if you only want to drop rows with nulls in certain columns, you can use the subset parameter. A less common approach is to drop any columns with no values, which you can do by passing access equal to one in dropna. Now you can see we're left with just one remaining column without any nulls. 

In [92]:
#drop columns with null using axis=1
temps.dropna(axis=1)

Unnamed: 0,sequence
0,1
1,2
2,3
3,4
4,5


Replace null values using ``fillna()``

Another method is to actually fill null values using fillna. To see this in action, we'll fill our nulls with zeros. At first glance, this could be problematic. Imagine if we were to calculate the mean for our temperature column. It would be heavily biased by the zero we just introduced. 

In [93]:
temps.fillna(0)

Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,0,0.0
4,5,estimated,49.64


Another more nuanced approach is to use the pad method. This will carry over values from a prior row. Now this method poses its own issues, largely because we've simply created data out of thin air. Given the drop from 91 degrees to 50 degrees that we see, we might expect sequence four to fall somewhere in the middle.

In [94]:
#pad method

temps.fillna(method='pad')

  temps.fillna(method='pad')


Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,actual,91.61
4,5,estimated,49.64


`Interpolate`

This brings us to our final method called interpolate. While interpolate allows for several different methods, the default approach will create a straight line estimate for our missing temperature value. There you go, now the estimate lies halfway between the two values. So before you get too far along analyzing your data, be sure to check for null values and put these methods to use.

In [95]:
#interpolate

temps.interpolate()

  temps.interpolate()


Unnamed: 0,sequence,measurement_type,temperature_f
0,1,actual,67.24
1,2,actual,84.56
2,3,actual,91.61
3,4,,70.625
4,5,estimated,49.64


**2.5. Apply/Map/Applymap**

Python functions can be applied with great impact in pandas to alter data in your data frames. Thankfully, you don't have to create a for-loop to iterate through every row in your data to do this. In fact, it's not encouraged if you can help it. Pandas has frameworks that are simpler and more performance known as Apply, Map, and Applymap. 

In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame({"Region":['North','West','East','South','North','West','East','South'],
          "Team":['One','One','One','One','Two','Two','Two','Two'],
          "Squad":['A','B','C','D','E','F','G','H'],
          "Revenue":[7500,5500,2750,6400,2300,3750,1900,575],
            "Cost":[5200,5100,4400,5300,1250,1300,2100,50]})

Let's dig in. We'll start off with the data frame, including revenue and cost data for certain regions, teams, and squad. Now let's say you want to determine whether each squad was returning a profit or not. This means revenue exceeds their loss. This is an excellent application for Apply. 

``Apply``
 allows you to harness functions to alter values along an access in your data frame, or in your series. 

We'll save a bit more time by using a Lambda function. 

`Lambda function` allows you to create the function in the Apply statement without having had to create it in advance. 

Our Lambda function will return the string profit if revenue is greater than cost, otherwise it will return loss. The application of this Lambda function will return a series which we will use to populate the profit column in our data frame. 

In [4]:
# use apply() to alter values along an axis in your dataframe or in a series by applying a function.

df['Profit'] = df.apply(lambda x: 'Profit' if x['Revenue']> x['Cost'] else 'Loss', axis=1)
df

Unnamed: 0,Region,Team,Squad,Revenue,Cost,Profit
0,North,One,A,7500,5200,Profit
1,West,One,B,5500,5100,Profit
2,East,One,C,2750,4400,Loss
3,South,One,D,6400,5300,Profit
4,North,Two,E,2300,1250,Profit
5,West,Two,F,3750,1300,Profit
6,East,Two,G,1900,2100,Loss
7,South,Two,H,575,50,Profit


After running, we see it worked as intended. In cases where revenue exceeds cost, we show profit and vice versa. Our next approach is called Map. And you can use it to alter values in a series only using a function dictionary or series. Here, we'll create a dictionary that maps teams to their corresponding color. 

In [6]:
# Use map to substitute each value in a series, using either a function, dictionary, or series.

team_map = {"One":"Red", "Two":"Blue"}

In [7]:
df['Team Color'] = df['Team'].map(team_map)
df

Unnamed: 0,Region,Team,Squad,Revenue,Cost,Profit,Team Color
0,North,One,A,7500,5200,Profit,Red
1,West,One,B,5500,5100,Profit,Red
2,East,One,C,2750,4400,Loss,Red
3,South,One,D,6400,5300,Profit,Red
4,North,Two,E,2300,1250,Profit,Blue
5,West,Two,F,3750,1300,Profit,Blue
6,East,Two,G,1900,2100,Loss,Blue
7,South,Two,H,575,50,Profit,Blue


Now we can map the values in the team column to a new column, which we'll call team color, looks great Another way to alter data in your data frame is to use Applymap. This applies a function to each element in your data frame. To show an example, we'll create a simple Lambda function which returns the character length of each value in our data frame.


In [8]:
# Use applymap() to apply a function to each element in your dataframe

df.applymap(lambda x: len(str(x)))

  df.applymap(lambda x: len(str(x)))


Unnamed: 0,Region,Team,Squad,Revenue,Cost,Profit,Team Color
0,5,3,1,4,4,6,3
1,4,3,1,4,4,6,3
2,4,3,1,4,4,4,3
3,5,3,1,4,4,6,3
4,5,3,1,4,4,6,4
5,4,3,1,4,4,6,4
6,4,3,1,4,4,4,4
7,5,3,1,3,2,6,4


Now of course, there may be times where it's simply conceptually easier to formulate your logic as a for loop rather than one of the above. And that's okay. Let's see an example on action. Below, we're going to calculate each squads revenue as a percent of the region's overall revenue. We'll start with an empty list that we populate as we iterate through each row in our data frame. To construct the for loop, we loop through each eye in the range from zero up to the length of our data frame. Rev represents the revenue value for that particular row divided by the sum of all revenue in our data frame where the region equals this particular squads region. Lastly, we append this value before continuing the loop.

In [9]:
# if all else fails, use a for loop
new_col = []

for i in range(0, len(df)):
    rev = df['Revenue'][i]/df[df['Region']==df.loc[i, 'Region']] ['Revenue'].sum()
    new_col.append(rev)

Next, we set a new column revenue share of region equal to this list. Let's check out the result. The output looks great and the revenue share within each region sums up to a hundred percent. 

In [10]:
df['Revenue Share of Region'] = new_col
df.sort_values(by='Region')


Unnamed: 0,Region,Team,Squad,Revenue,Cost,Profit,Team Color,Revenue Share of Region
2,East,One,C,2750,4400,Loss,Red,0.591398
6,East,Two,G,1900,2100,Loss,Blue,0.408602
0,North,One,A,7500,5200,Profit,Red,0.765306
4,North,Two,E,2300,1250,Profit,Blue,0.234694
3,South,One,D,6400,5300,Profit,Red,0.917563
7,South,Two,H,575,50,Profit,Blue,0.082437
1,West,One,B,5500,5100,Profit,Red,0.594595
5,West,Two,F,3750,1300,Profit,Blue,0.405405


The methods we just implemented are among the most powerful in pandas. Between them, there's enough flexibility for just about any application you'll come across. Remember, Apply can be used for both data frames and series. Map works for series only. Applymap affects each element in a data frame. And when in doubt, for a loop should do the trick.