* [.apply() method](#apply_method)
* [.apply() method with multiple columns](#apply_with_multiple)
* [np.vectorize()](#vectorization)
* [timeit module](#timeit)
* [.describe() method](#describe)
* [.transpose() method](#transpose)
* [.sort_values() method](#sort)
* [.max() method, .idxmax() method](#max)
* [.min() method, .idxmin() method](#min)
* [.corr() method](#corr)
* [.value_counts() method](#value_count)
* [.unique() method, .nunique() method](#unique)
* [.replace() method](#replace)
* [.map() method](#map)
* [.duplicated() method, .drop_duplicates() method](#dupl) 
* [.between() method](#between)
* [.nlargest() method, .nsmallest() method](#large) 
* [.sample() method](#sample)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('tips.csv')

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


<a id='apply_method'></a>

__`.apply() method`__

It allows you to create a custom function and then apply it to a series in pandas.

Grabbing the last n number of digits of 'CC Number' column (Credit Card Number)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


We can see that 'CC Number' dtype is integer. 

In [9]:
def last_four(num):
    return int(str(num)[-4:])

In [10]:
last_four(df['CC Number'][0])

3410

We want to apply 'last_four' function to every single row in 'CC Number' column.

In [11]:
df['CC Number'].apply(last_four)

0      3410
1      9230
2      1322
3      5994
4      7221
       ... 
239    2842
240    5404
241    7196
242     950
243    8139
Name: CC Number, Length: 244, dtype: int64

### !!!!!

__Something to note here is I'm just going to pass in the function itself. I am not going to actually execute or call the function.__

In [12]:
df['last_four'] = df['CC Number'].apply(last_four)

___

In [14]:
df['total_bill'].mean()

19.785942622950824

In [15]:
def yelp(price):
    if price < 10:
        return '$'
    elif price >= 10 and price < 30:
        return '$$'
    else:
        return '$$$'

In [16]:
df['yelp'] = df['total_bill'].apply(yelp)

In [18]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,950,$$


___

__What if I have a function that is dependent on two columns?__

<a id='apply_with_multiple'></a>

__`.apply() method with multiple columns`__

In [19]:
def simple(num):
    return num * 2

Converting a function into a single use __lambda expression__.

A lambda expression is essentially an anonymous expression, meaning we don't actually apply a name to it in the same way we apply a name to a function.

In [20]:
lambda num: num * 2

<function __main__.<lambda>(num)>

__Keep in mind__, NOT everything can be converted to a lambda expression.

In [21]:
df['total_bill'].apply(simple)

0      33.98
1      20.68
2      42.02
3      47.36
4      49.18
       ...  
239    58.06
240    54.36
241    45.34
242    35.64
243    37.56
Name: total_bill, Length: 244, dtype: float64

Or alternatively

In [22]:
df['total_bill'].apply(lambda num: num * 2)

0      33.98
1      20.68
2      42.02
3      47.36
4      49.18
       ...  
239    58.06
240    54.36
241    45.34
242    35.64
243    37.56
Name: total_bill, Length: 244, dtype: float64

___

In [23]:
def quality(total_bill, tip):
    if tip / total_bill > 0.25:
        return "Generous"
    else:
        return "Other"

In [24]:
df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis=1)

0      Other
1      Other
2      Other
3      Other
4      Other
       ...  
239    Other
240    Other
241    Other
242    Other
243    Other
Length: 244, dtype: object

____

- You select the columns you're going to be using in your function.
- call .apply()
- You call lambda on the dataframe being passed in. 
- and then into your custom function you're going to pass in the columns you use.
- and then, to make sure that this works correctly, you just specify the axis equal to one.



___

<a id=vectorization></a>

__`np.vectorize()`__

Now I want to show a way to make this __run a lot faster__, and the way we do that is by actually calling __np.vectorize()__.

In [25]:
df['Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])

In [26]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$,Other
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$,Other
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,950,$$,Other


__Vectorization__ is a good thing to keep in mind when we're specifically applying functions that to begin with are NOT aware that they're going to be applied to a numpy array.

The purpose of np.vectorize() is to transform functions which are not numpy aware.

If we take a look back at this original 'quality' function, it is not aware that we're going to be passing in and broadcasting this to what is essentially a numpy array within the pandas series objects.
<br>Every pandas column in a dataframe is a pandas series, which in turn is holding a numpy array.

Our 'quality' function accepts normal integers and floating point numbers. It's not actually thinking of this as being broadcasted to a numpy array.



So if I actually transform this or vectorize this to be numpy aware, then it actually will get computationally more efficient, because I'm really transforming this from a normal function to then be aware of a numpy array.

___

<a id=timeit></a>

__`timeit module`__


It takes in two things - setup code and then statement code.

In [27]:
import timeit

In [28]:
# code snippet to be executed only once
setup = """
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill, tip):
    if tip / total_bill > 0.25:
        return 'Generous'
    else:
        return 'Other'
"""

In [29]:
# code snippet whose execution time is to be measured
stmt_one = '''
df['tip_quality'] = df[['total_bill', 'tip']].apply(lambda df: quality(df['total_bill'], df['tip']), axis=1)
'''

stmt_two = '''
df['tip_quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''

Basically what it does is it takes in a bunch of code as multiline strings for setup and then the statements you actually want to run after you run the setup code and then it just times it, runs it for multiple loops and reports back how long it took.

In [31]:
timeit.timeit(setup=setup, stmt=stmt_one, number=1000)

# there's a number of how many times you want to run it. I'm going to run it one thousand times

8.538050700000895

In [35]:
timeit.timeit(stmt=stmt_two, setup=setup, number=1000)

0.7156521000033536

___

### Describing and sorting

In [36]:
df = pd.read_csv('tips.csv')

<a id='describe'></a>

__`.describe() method`__

In [37]:
df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


Sometimes these methods don't make sense for every column. For example, it doesn't really make sense to take the standard deviation for the credit card number column.

<a id='transpose'></a>

__`.transpose() method`__

In [38]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


___

<a id='sort'></a>

__`.sort_values() method`__

In [39]:
df.sort_values('tip')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
67,3.07,1.00,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455
236,12.60,1.00,Male,Yes,Sat,Dinner,2,6.30,Matthew Myers,3543676378973965,Sat5032
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
...,...,...,...,...,...,...,...,...,...,...,...
141,34.30,6.70,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590


In [41]:
df.sort_values('tip', ascending=False)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
170,50.81,10.00,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139
141,34.30,6.70,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025
...,...,...,...,...,...,...,...,...,...,...,...
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
236,12.60,1.00,Male,Yes,Sat,Dinner,2,6.30,Matthew Myers,3543676378973965,Sat5032
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801
67,3.07,1.00,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455


You can actually __sort by various columns or more than one column__. (pass the columns in as a list)

So the first couple of tips are all the same, it's one dollar, one dollar, one dollar, one dollar. So that means I can sort first by the tip column and then sort by another column.

In [43]:
df.sort_values(['tip', 'size'])

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
67,3.07,1.00,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780
236,12.60,1.00,Male,Yes,Sat,Dinner,2,6.30,Matthew Myers,3543676378973965,Sat5032
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
...,...,...,...,...,...,...,...,...,...,...,...
141,34.30,6.70,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590


So it sorts by the tip and then for each of those same tip values, it sorts by the size.

___

__Grabbing the index locations of the min and max values.__

<a id='max'></a>

__`.max() method, .idxmax() method`__

In [44]:
df['total_bill'].max()

50.81

In [45]:
df['total_bill'].idxmax()

170

In [46]:
df.iloc[170]

total_bill                     50.81
tip                             10.0
sex                             Male
smoker                           Yes
day                              Sat
time                          Dinner
size                               3
price_per_person               16.94
Payer Name             Gregory Clark
CC Number           5473850968388236
Payment ID                   Sat1954
Name: 170, dtype: object

___

<a id='min'></a>

__`.min() method, .idxmin() method`__

In [51]:
df['total_bill'].min()

3.07

In [52]:
df.iloc[df['total_bill'].idxmin()]

total_bill                      3.07
tip                              1.0
sex                           Female
smoker                           Yes
day                              Sat
time                          Dinner
size                               1
price_per_person                3.07
Payer Name             Tiffany Brock
CC Number           4359488526995267
Payment ID                   Sat3455
Name: 67, dtype: object

___

<a id='corr'></a>

__`.corr() method`__

a quick correlation check

In [54]:
# this in general only works with numeric columns

df.corr()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
total_bill,1.0,0.675734,0.598315,0.647554,0.104576
tip,0.675734,1.0,0.489299,0.347405,0.110857
size,0.598315,0.489299,1.0,-0.175359,-0.030239
price_per_person,0.647554,0.347405,-0.175359,1.0,0.13524
CC Number,0.104576,0.110857,-0.030239,0.13524,1.0


___

<a id='value_count'></a>

__`.value_counts() method`__

__getting a count per category__ (this only makes sense to call on a categorical column)

In [55]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [56]:
df['sex'].value_counts()

Male      157
Female     87
Name: sex, dtype: int64

<a id='unique'></a>

__`.unique() method, .nunique() method`__

In [57]:
df['day'].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [60]:
# the number of unique values

df['day'].nunique()

4

In [62]:
df['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

___

<a id='replace'></a>

__`.replace() method`__

__.replace()__ and __.map()__ methods are kind of similar.

One uses a dictionary and one can use a list of values.

Let's imagine, I want to replace 'Female' and 'Male' with just the letters 'F' and 'M'.

__replacing a single value__

In [63]:
df['sex'].replace('Female', 'F')

0         F
1      Male
2      Male
3      Male
4         F
       ... 
239    Male
240       F
241    Male
242    Male
243       F
Name: sex, Length: 244, dtype: object

This can also take in a list of values.

In [64]:
df['sex'].replace(['Female', 'Male'], ['F', 'M'])

0      F
1      M
2      M
3      M
4      F
      ..
239    M
240    F
241    M
242    M
243    F
Name: sex, Length: 244, dtype: object

___

<a id='map'></a>

__`.map() method`__

The first step to use map is to create a mapping as a dictionary.

In [65]:
mymap = {'Female': 'F', 
         'Male': 'M'}

In [66]:
df['sex'].map(mymap)

0      F
1      M
2      M
3      M
4      F
      ..
239    M
240    F
241    M
242    M
243    F
Name: sex, Length: 244, dtype: object

__.replace()__ is easier with __fewer items__ you intend to replace.

__.map()__ is easier if you intend to replace __lots of items__.

___

#### Duplicated rows

In [76]:
simple_df = pd.DataFrame([1, 2, 2], ['a', 'b', 'c'])

In [77]:
simple_df

Unnamed: 0,0
a,1
b,2
c,2


<a id='dupl'></a>

__`.duplicated() method, .drop_duplicates() method`__

In [78]:
simple_df.duplicated()

a    False
b    False
c     True
dtype: bool

In [79]:
simple_df[simple_df.duplicated()]

Unnamed: 0,0
c,2


In [80]:
simple_df.drop_duplicates()

Unnamed: 0,0
a,1
b,2


___

<a id='between'></a>

__`.between() method`__

which values in a particular column are between two values you choose.

In [85]:
df['total_bill'].between(10, 20, inclusive=True)

# inclusive - to include end points

  """Entry point for launching an IPython kernel.


0       True
1       True
2      False
3      False
4      False
       ...  
239    False
240    False
241    False
242     True
243     True
Name: total_bill, Length: 244, dtype: bool

In [88]:
df[df['total_bill'].between(10, 10.5, inclusive=True)]

  """Entry point for launching an IPython kernel.


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
10,10.27,1.71,Male,No,Sun,Dinner,2,5.14,William Riley,566287581219,Sun2546
16,10.33,1.67,Female,No,Sun,Dinner,3,3.44,Elizabeth Foster,4240025044626033,Sun9715
51,10.29,2.6,Female,No,Sun,Dinner,2,5.14,Jessica Ibarra,4999759463713,Sun4474
82,10.07,1.83,Female,No,Thur,Lunch,1,10.07,Julie Moody,630413282843,Thur4909
136,10.33,2.0,Female,No,Thur,Lunch,2,5.16,Donna Kelly,180048553626376,Thur1393
196,10.34,2.0,Male,Yes,Thur,Lunch,2,5.17,Eric Martin,30442491190342,Thur9862
226,10.09,2.0,Female,Yes,Fri,Lunch,2,5.04,Ruth Weiss,5268689490381635,Fri6359
235,10.07,1.25,Male,No,Sat,Dinner,2,5.04,Sean Gonzalez,3534021246117605,Sat4615


___

<a id='large'></a>

__`.nlargest() method, .nsmallest() method`__

In [89]:
df.nlargest(2, 'tip')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
170,50.81,10.0,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954
212,48.33,9.0,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590


In [90]:
# which is the same as

df.sort_values('tip', ascending=False).iloc[:2]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
170,50.81,10.0,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954
212,48.33,9.0,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590


In [91]:
df.nsmallest(2, 'tip')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
67,3.07,1.0,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455
92,5.75,1.0,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780


In [93]:
df.sort_values('tip').iloc[:2]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
67,3.07,1.0,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032


___

#### taking a random sample of a dataframe.

<a id='sample'></a>

__`.sample() method`__

You can either sample back a certain number of rows or sample back a fraction of your dataframe.

In [94]:
df.sample(4)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
29,19.65,3.0,Female,No,Sat,Dinner,2,9.82,Melinda Murphy,5489272944576051,Sat2467
43,9.68,1.32,Male,No,Sun,Dinner,2,4.84,Christopher Spears,4387671121369212,Sun3279
178,9.6,4.0,Female,Yes,Sun,Dinner,2,4.8,Melanie Gray,4211808859168,Sun4598
15,21.58,3.92,Male,No,Sun,Dinner,2,10.79,Matthew Reilly,180073029785069,Sun1878


In [98]:
# so this will sample just 5 percent of the rows of the dataframe

df.sample(frac=0.05) 

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
56,38.01,3.0,Male,Yes,Sat,Dinner,4,9.5,James Christensen DDS,349793629453226,Sat8903
222,8.58,1.92,Male,Yes,Fri,Lunch,1,8.58,Jason Lawrence,3505302934650403,Fri6624
204,20.53,4.0,Male,Yes,Thur,Lunch,4,5.13,Scott Kim,3570611756827620,Thur2160
162,16.21,2.0,Female,No,Sun,Dinner,3,5.4,Jennifer Baird,4227834176859693,Sun5521
70,12.02,1.97,Male,No,Sat,Dinner,2,6.01,Max Brown,213139760497718,Sat2100
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
22,15.77,2.23,Female,No,Sat,Dinner,2,7.88,Ashley Shelton,3524119516293213,Sat9786
28,21.7,4.3,Male,No,Sat,Dinner,2,10.85,David Collier,5529694315416009,Sat3697
214,28.17,6.5,Female,Yes,Sat,Dinner,3,9.39,Marissa Jackson,4922302538691962,Sat3374
155,29.85,5.14,Female,No,Sun,Dinner,5,5.97,Madison Wilson,4210875236164664,Sun9176


___