# 4. Method Duplication

There are multiple methods in Pandas that do the exact same thing. Whenever two methods share the same exact underlying functionality, we say that they are **aliases** of each other.  Having duplication in a library is completely unnecessary and just pollutes the namespace and forces analysts to remember one more bit of information about a library.

This notebook covers several instances of duplication along with other instances of methods that are very similar to one another.

### `read_csv` vs `read_table` duplication
One example of duplication is with the `read_csv` and `read_table` functions. They both do the same exact thing, read in data from a text file. The only difference is that `read_csv` defaults the delimiter to a comma, while `read_table` has the default parameter as a tab.

Let's verify that `read_csv` and `read_table` are the same. The `equals` method verifies whether two DataFrames have the exact same values.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)

college = pd.read_csv('data/college.csv')
college2 = pd.read_table('data/college.csv', delimiter=',')
college.equals(college2)

  """


True

In [2]:
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### `read_table` is getting deprecated
I made a post in [the Pandas Github repo][1] suggesting that a few functions and methods that I'd like to see deprecated. The `read_table` function is getting deprecated and should never be used.

**GUIDANCE** - Use only `read_csv`

[1]: https://github.com/pandas-dev/pandas/issues/18262#issuecomment-346502617

### `isna` vs `isnull` and `notna` vs `notnull`
The `isna` and `isnull` methods both determine whether each value in the DataFrame is missing or not. The result will always be a DataFrame (or Series) of all boolean values.

These methods are exactly the same. We say that one is an **alias** for the other. There is no need for both of them in the library. The `isna` method was added more recently because the characters `na` are found in other missing value methods such as `dropna` and `fillna`.

`notna` and `notnull` are aliases of each other as well and simply return the opposite of `isna`. There's no need for both of them.

Let's verify that `isna` and `isnull` are aliases.

In [None]:
college_isna = college.isna()
college_isnull = college.isnull()
college_isna.equals(college_isnull)

### I only use `isna` and `notna`

I use the methods that end in `na` to match the names of the other missing value methods `dropna` and `fillna`. 

You can also avoid ever using `notna` since Pandas provides the inversion operator, `~` to invert boolean DataFrames.

**GUIDANCE** - Use only `isna` and `notna`

## Arithmetic and Comparison Operators

All arithmetic operators have corresponding methods that function similarly.

* `+` - `add`
* `-` - `sub` and `subtract`
* `*` - `mul` and `multiply`
* `/` - `div`, `divide` and `truediv`
* `/` - `pow`
* `//` - `floordiv`
* `%` - `mod`

All the comparison operators also have corresponding methods.

* `>` - `gt`
* `<` - `lt`
* `>=` - `ge`
* `<=` - `le`
* `==` - `eq`
* `!=` - `ne`

Let's select the undergraduate population (ugds) column as Series, add 100 to it and verify that both the plus operator its corresponding method, `add` give the same result.

In [None]:
ugds = college['ugds']
ugds_operator = ugds + 100
ugds_method = ugds.add(100)
ugds_operator.equals(ugds_method)

### Calculating the z-scores of each school
Let's do a slightly more complex example. Below, we set the index to be the institution name and then select both of the SAT columns.

In [3]:
college_idx = college.set_index('instnm')
sats = college_idx[['satmtmid', 'satvrmid']].dropna()
sats.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,420.0,424.0
University of Alabama at Birmingham,565.0,570.0
University of Alabama in Huntsville,590.0,595.0
Alabama State University,430.0,425.0
The University of Alabama,565.0,555.0


Let's say we are interested in finding the z-score for each college's SAT score. To calculate this, we would need to subtract the mean and divide by the standard deviation. Let's do that first with operators.

In [4]:
mean = sats.mean()
mean

satmtmid    530.958615
satvrmid    522.775338
dtype: float64

In [5]:
std = sats.std()
std

satmtmid    73.645153
satvrmid    68.591051
dtype: float64

In [6]:
zscore_operator = (sats - mean) / std
zscore_operator.head()

Unnamed: 0_level_0,satmtmid,satvrmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,-1.506666,-1.440062
University of Alabama at Birmingham,0.462235,0.688496
University of Alabama in Huntsville,0.801701,1.052975
Alabama State University,-1.370879,-1.425482
The University of Alabama,0.462235,0.469809


Let's repeat this with the methods and verify equality.

In [None]:
zscore_methods = sats.sub(mean).div(std)
zscore_operator.equals(zscore_methods)

### An actual need for the method
So far we haven't seen an explicit need for the methods over the operators. Let's see an example where we absolutely need the method to complete the task. The college dataset contains 9 consecutive columns holding the frequency of the undergraduate population by race. The first column is `ugds_white` and the last `ugds_unkn`. Let's select these columns now into their own DataFrame.

In [7]:
college_race = college_idx.loc[:, 'ugds_white':'ugds_unkn']
college_race.head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


Let's say we are interested in the raw count of each race per school. We would need to multiply the total undergraduate population by each column. Below we select the `ugds` column as a Series.

In [8]:
ugds = college_idx['ugds']
ugds.head()

instnm
Alabama A & M University                4206.0
University of Alabama at Birmingham    11383.0
Amridge University                       291.0
University of Alabama in Huntsville     5451.0
Alabama State University                4811.0
Name: ugds, dtype: float64

We then multiple the `college_race` DataFrame by this Series. Intuitively, this seems like it should work, but it doesn't. Instead, it returns an enormous DataFrame with 7,544 columns.

In [9]:
df_attempt = college_race * ugds
df_attempt.head()

Unnamed: 0_level_0,A & W Healthcare Educators,A T Still University of Health Sciences,ABC Beauty Academy,ABC Beauty College Inc,AI Miami International University of Art and Design,AIB College of Business,AOMA Graduate School of Integrative Medicine,ASA College,ASI Career Institute,ASM Beauty World Academy,ATA Career Education,ATA College,ATEP at IVC,ATI College-Norwalk,ATS Institute of Technology,AVTEC-Alaska's Institute of Technology,Aaniiih Nakoda College,Aaron's Academy of Beauty,Abcott Institute,Abdill Career College Inc,Abilene Christian University,Abington Memorial Hospital Dixon School of Nursing,Abraham Baldwin Agricultural College,Academia Serrant Inc,Academy College,Academy Di Capelli-School of Cosmetology,Academy di Firenze,Academy for Careers and Technology,Academy for Five Element Acupuncture,Academy for Jewish Religion-California,Academy for Nursing and Health Occupations,Academy for Salon Professionals,Academy of Art University,Academy of Career Training,Academy of Careers and Technology,Academy of Chinese Culture and Health Sciences,Academy of Cosmetology,Academy of Cosmetology and Esthetics NYC,Academy of Couture Art,Academy of Esthetics and Cosmetology,Academy of Hair Design-Beaumont,Academy of Hair Design-Grenada,Academy of Hair Design-Jackson,Academy of Hair Design-Jasper,Academy of Hair Design-Las Vegas,Academy of Hair Design-Lufkin,Academy of Hair Design-Oklahoma City,Academy of Hair Design-Pearl,Academy of Hair Design-Salem,Academy of Hair Design-Springfield,...,Yeshiva College of the Nations Capital,Yeshiva D'monsey Rabbinical College,Yeshiva Derech Chaim,Yeshiva Gedolah Imrei Yosef D'spinka,Yeshiva Gedolah Kesser Torah,Yeshiva Gedolah Zichron Leyma,Yeshiva Gedolah of Greater Detroit,Yeshiva Karlin Stolin,Yeshiva Ohr Elchonon Chabad West Coast Talmudical Seminary,Yeshiva Shaar Hatorah,Yeshiva Shaarei Torah of Rockland,Yeshiva Toras Chaim,Yeshiva University,Yeshiva Yesodei Hatorah,Yeshiva and Kollel Harbotzas Torah,Yeshiva of Far Rockaway Derech Ayson Rabbinical Seminary,Yeshiva of Machzikai Hadas,Yeshiva of Nitra Rabbinical College,Yeshiva of the Telshe Alumni,Yeshivah Gedolah Rabbinical College,Yeshivas Be'er Yitzchok,Yeshivas Novominsk,Yeshivat Mikdash Melech,Yeshivath Beth Moshe,Yeshivath Viznitz,Yeshivath Zichron Moshe,Yo San University of Traditional Chinese Medicine,York College,York College Pennsylvania,York County Community College,York County School of Technology-Adult & Continuing Education,York Technical College,Yorktowne Business Institute,Young Harris College,Youngstown State University,Yuba College,Yukon Beauty College Inc,Z Hair Academy,Zane State College,duCret School of Arts,eClips School of Cosmetology and Barbering,ugds_2mor,ugds_aian,ugds_asian,ugds_black,ugds_hisp,ugds_nhpi,ugds_nra,ugds_unkn,ugds_white
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
Alabama A & M University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
University of Alabama at Birmingham,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Amridge University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
University of Alabama in Huntsville,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Alabama State University,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [10]:
df_attempt.shape

(7535, 7544)

### Automatic alignment on the index and/or columns
Whenever an operation takes place between two Pandas objects, an alignment always takes place between the index and/or columns of the two objects. In the above operation, we multiplied The `college_race` **DataFrame** and the `ugds` **Series** together. Pandas automatically (implicitly) aligned the **columns** of `college_race` to the **index** values of `ugds`.

None of the `college_race` columns match the index values of `ugds`. Pandas does the alignment by performing an **outer join** keeping all values that match as well as those that do not. This returns a ridiculous looking DataFrame with all missing values. Scroll all the way to the right to view the original column names of the `college_race` DataFrame.

### Change the direction of the alignment with a method
All operators only work in a single way. We cannot change how the multiplication operator, `*`, works. Methods, on the other hand, can have parameters that we can use to control how the operation takes place. 

### Use the `axis` parameter of the `mul` method
All the methods that correspond to the operators listed above have an `axis` parameter that allows us to change the direction of the alignment. So, instead of aligning the columns of a DataFrame to the index of a Series, we can align the index of a DataFrame to the index of a Series. Let's do that now so that we can find the answer to our problem from above.

In [11]:
df_correct = college_race.mul(ugds, axis='index').round(0)
df_correct.head()

Unnamed: 0_level_0,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,140.0,3934.0,23.0,8.0,10.0,8.0,0.0,25.0,58.0
University of Alabama at Birmingham,6741.0,2960.0,322.0,590.0,25.0,8.0,419.0,204.0,114.0
Amridge University,87.0,122.0,2.0,1.0,0.0,0.0,0.0,0.0,79.0
University of Alabama in Huntsville,3809.0,684.0,208.0,205.0,78.0,1.0,94.0,181.0,191.0
Alabama State University,76.0,4430.0,58.0,9.0,5.0,3.0,47.0,117.0,66.0


### Discussion
By default, the `axis` parameter is set to 'columns'. We changed it to 'index' so that a proper alignment took place

### Another use-case for the method - filling in missing values
There exists another use-case when the operator cannot be used. Take a look at the following two Series. They are each a different length and have some index values in common.

In [None]:
s1 = pd.Series(index=['a', 'b', 'c', 'e'], data=[4, 8, 3, 5])
s2 = pd.Series(index=['a', 'b', 'd'], data=[2, 1, 9])

In [None]:
s1

In [None]:
s2

It is still possible to do an arithmetic or comparison operation between them. Let's add the two Series together. 

In [None]:
s1 + s2

### Explanation
Pandas automatically aligns the indexes together and then adds the values of the Series. Only index 'a' and 'b' are contained in each Series. The others don't align, but are still kept in the result. Again, Pandas does an outer join here when aligning.

### Use the corresponding method to fill in the missing values
The `add` method performs the same addition but allows us to control what happens to those values that don't align. We can set the value to be used if there is no alignment with the `fill_value` parameter. Below, we use it to add 100 to each value that is not aligned.

In [None]:
s1.add(s2, fill_value=100)

## Built in functions vs Pandas methods with the same name
There are a few DataFrame/Series methods that will return the same result if a built-in Python function with the same name is used. They are:
* `sum`
* `min`
* `max`
* `abs`

Let's verify that the give the same result by testing them out on a single column of data.

In [13]:
ugds = college['ugds'].dropna()
ugds.head()

0     4206.0
1    11383.0
2      291.0
3     5451.0
4     4811.0
Name: ugds, dtype: float64

In [14]:
sum(ugds)

16200904.0

In [15]:
ugds.sum()

16200904.0

In [16]:
max(ugds)

151558.0

In [17]:
ugds.max()

151558.0

In [18]:
min(ugds)

0.0

In [19]:
ugds.min()

0.0

In [20]:
abs(ugds).head()

0     4206.0
1    11383.0
2      291.0
3     5451.0
4     4811.0
Name: ugds, dtype: float64

In [21]:
ugds.abs().head()

0     4206.0
1    11383.0
2      291.0
3     5451.0
4     4811.0
Name: ugds, dtype: float64

### Time the performance of each

**`sum`**

In [22]:
%timeit -n 5 sum(ugds)

644 µs ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [23]:
%timeit -n 5 ugds.sum()

164 µs ± 81 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


**`min`**

In [24]:
%timeit -n 5 min(ugds)

705 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [25]:
%timeit -n 5 ugds.min()

151 µs ± 64 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


**`max`**

In [26]:
%timeit -n 5 max(ugds)

717 µs ± 46.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [27]:
%timeit -n 5 ugds.max()

172 µs ± 81.9 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


**`abs`**

In [28]:
%timeit -n 5 abs(ugds)

138 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [29]:
%timeit -n 5 ugds.abs()

128 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Performance discrepancy for `sum`, `max`, and `min`
There are clear performance discrepancies for `sum`, `max`, and `min`. Completely different code is executed when the built-in Python functions are used as opposed to when the Pandas method is called. Calling `sum(ugds)` essentially creates a Python for loop to iterate through each value one at a time. On the other hand, calling `ugds.sum()` executes the internal Pandas `sum` method which is written in C and much faster than the iterating with a Python for loop.

There is a lot of overhead in Pandas which is why the difference is not greater. If we instead create a NumPy array and redo the timings, we can see an enormous difference with the Numpy array `sum` outperforming the Python `sum` function by a factor of 200 on an array of 10,000 floats.

In [None]:
len(ugds)

In [None]:
import numpy as np

In [None]:
a = np.random.rand(10**4)

In [None]:
%timeit -n 10 sum(a)

In [None]:
%timeit -n 100 a.sum()

## No Performance difference for `abs`
Notice that there is no performance difference when calling the `abs` function versus the `abs` Pandas method. This is because the exact same underlying code is being called. This is due to how Python chose to design the `abs` function. It allows developers to provide a custom `abs` method to be executed whenever the `abs` function is called. So, they are both literally the same.

**GUIDANCE** - Use the Pandas method over any built-in Python function with the same name.