In [1]:
import pandas
print('pandas',pandas.__version__)

pandas 0.23.4


In [2]:
df = pandas.read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/mlb_players.csv", 
                     skiprows=[1035],
                     skipinitialspace=True)
df.head()

Unnamed: 0,Name,Team,Position,Height(inches),Weight(lbs),Age
0,Adam Donachie,BAL,Catcher,74,180.0,22.99
1,Paul Bako,BAL,Catcher,74,215.0,34.69
2,Ramon Hernandez,BAL,Catcher,72,210.0,30.78
3,Kevin Millar,BAL,First Baseman,72,210.0,35.43
4,Chris Gomez,BAL,First Baseman,73,188.0,35.71


Here's a function that we may use often to explore a dataframe:

In [3]:
for this_column in df.columns:
    print("==== ",this_column,"has",df[this_column].nunique(),"unique entries ====")
    print(df[this_column].value_counts().head(10))

====  Name has 1032 unique entries ====
Chris Young        2
Tony Pe?a          2
David Shafer       1
Scott Kazmir       1
David Ortiz        1
Victor Martinez    1
Jeff DaVanon       1
Nick Johnson       1
Josh Shortslef     1
Scott Podsednik    1
Name: Name, dtype: int64
====  Team has 30 unique entries ====
NYM    38
ATL    37
OAK    37
DET    37
BOS    36
WAS    36
PHI    36
CHC    36
CIN    36
BAL    35
Name: Team, dtype: int64
====  Position has 9 unique entries ====
Relief Pitcher       315
Starting Pitcher     221
Outfielder           194
Catcher               76
Second Baseman        58
First Baseman         55
Shortstop             52
Third Baseman         45
Designated Hitter     18
Name: Position, dtype: int64
====  Height(inches) has 17 unique entries ====
74    175
73    167
75    160
72    152
76    103
71     89
77     57
70     52
78     27
69     19
Name: Height(inches), dtype: int64
====  Weight(lbs) has 89 unique entries ====
200.0    108
190.0     97
180.0     81


Rather than copy-pasting it from notebook to notebook, place the code inside a function in a .py file.

To show you the contents of the .py file, I'll use "cat" command:

In [3]:
!cat myfunctions.py

def unique_entries_in_frame(df,count):
    for this_column in df.columns:
        print("==== ",this_column,"has",df[this_column].nunique(),"unique entries ====")
        print(df[this_column].value_counts().head(count))
    return

load this function using `%run` cell magic

https://ipython.readthedocs.io/en/stable/interactive/magics.html

In [4]:
%run myfunctions.py

To use the function, I need to know the name of the function and the arguments

In [5]:
unique_entries_in_frame(df,5)

====  Name has 1032 unique entries ====
Chris Young      2
Tony Pe?a        2
Mark DeRosa      1
Merkin Valdez    1
Sean Tracey      1
Name: Name, dtype: int64
====  Team has 30 unique entries ====
NYM    38
ATL    37
DET    37
OAK    37
CIN    36
Name: Team, dtype: int64
====  Position has 9 unique entries ====
Relief Pitcher      315
Starting Pitcher    221
Outfielder          194
Catcher              76
Second Baseman       58
Name: Position, dtype: int64
====  Height(inches) has 17 unique entries ====
74    175
73    167
75    160
72    152
76    103
Name: Height(inches), dtype: int64
====  Weight(lbs) has 89 unique entries ====
200.0    108
190.0     97
180.0     81
210.0     72
220.0     72
Name: Weight(lbs), dtype: int64
====  Age has 725 unique entries ====
24.94    7
27.12    6
31.28    5
24.63    5
29.95    4
Name: Age, dtype: int64


Alternatively, we can use `%load` to show the file content

https://stackoverflow.com/questions/21034373/how-to-load-edit-run-save-text-files-py-into-an-ipython-notebook-cell

In [6]:
# %load myfunctions.py
def unique_entries_in_frame(df,count):
    for this_column in df.columns:
        print("==== ",this_column,"has",df[this_column].nunique(),"unique entries ====")
        print(df[this_column].value_counts().head(count))
    return

The advantage of `%load` is that the source code is part of the notebook, so we don't need to store the .py source with the .ipynb notebook.

That is also a disadvantage: if we update the .py source, the change does not impact the code used by the notebook.