# Pandas for Data Analysis: Analyzing Data

## Outline:

* [Knowing Basic Stats](#Knowing-Basic-Stats)
* [Grouping](#Grouping)
* [Creating Pivot Table](#Creating-Pivot-Table)
* [Analyzing Public Datasets](#Analyzing-Public-Datasets)


In [15]:
import pandas as pd

## Knowing Basic Stats

In [None]:
data = {
    'age': [25, 30, 35],
    'savings': [3000, 3100, 1500]
}
df = pd.DataFrame(data=data)

In [None]:
df.describe()

In [None]:
df.cov()

In [None]:
df.corr()

### Challenges

จาก Series ของค่าไฟปี 2015 โดยแต่ละเดือนมีค่าไฟตามนี้

* January มียอด 3,000 บาท
* February มียอด 3,512 บาท
* March มียอด 1,900 บาท
* April มียอด 1,988 บาท
* May มียอด 3,012 บาท
* June มียอด 2,912.35 บาท
* July มียอด 3,100 บาท
* August มียอด 2,501.02 บาท
* September มียอด 3,309 บาท
* October มียอด 2,087 บาท
* November มียอด 4,223 บาท
* December มียอด 3,566 บาท

Hint: ให้ใช้เดือนเป็น index และยอดเงินเป็นค่าของแต่ละ index

ลองตอบคำถามต่อไปนี้
1. รวมทั้งปีแล้วต้องจ่ายค่าไฟเท่าไหร่? เฉลี่ยเดือนละเท่าไหร่?
2. เดือนไหนจ่ายค่าไฟเยอะสุด?

จาก DataFrame ข้อมูลเงินเดือนของพนักงาน

คนที่ 1

* ชื่อ William
* อาชีพ Chief Investment Officer
* เงินเดือนทั้งปี 507,831.60 USD

คนที่ 2

* ชื่อ Ellen
* อาชีพ Asst Med Examiner
* เงินเดือนทั้งปี 279,311.10 USD

คนที่ 3

* ชื่อ Barbara
* อาชีพ Dept Head
* รายได้ทั้งปี 307,580.34 USD

ลองตอบคำถามต่อไปนี้
1. ใครได้รายได้ต่อปีเยอะที่สุด?
2. ใครได้รายได้ต่อปีต่ำกว่า 300,000 USD บ้าง?

---

## Grouping

In [None]:
adult_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv(adult_data_url, names=columns)

In [None]:
adult.groupby('education').agg('mean').tail()

In [None]:
# Same result as above

adult_group = adult.groupby('education')
adult_group.mean().tail()

In [None]:
adult.groupby(['education', 'sex']).mean().head()

In [None]:
adult.groupby(['education', 'sex']).mean().head(30)

In [None]:
adult.columns = adult.columns.str.lower().str.replace(' ', '-')
adult[['capital-gain', 'capital-loss', 'money-per-year']].groupby('money-per-year').mean()

### Challenges

---

## Creating Pivot Table

http://jalammar.github.io/visualizing-pandas-pivoting-and-reshaping/

ลองเล่นข้อมูล Sales Funnel จาก [Practical Business Python](https://pbpython.com/)

In [None]:
df = pd.read_excel('data/sales-funnel.xlsx')

In [None]:
df.head()

In [None]:
pd.pivot_table(df, index=['Name'])

In [None]:
pd.pivot_table(df, index=['Name', 'Product'])

In [None]:
pd.pivot_table(df, index=['Name', 'Product'], values=['Quantity'])

In [None]:
import numpy as np

In [None]:
pd.pivot_table(df, index=['Product'], values=['Price'], aggfunc=[np.sum, np.mean])

In [None]:
pd.pivot_table(df, index=['Name', 'Product'], columns=['Status'], values=['Price'])

In [None]:
pd.pivot_table(df, index=['Name', 'Product'], columns=['Status'], values=['Price'], fill_value=0)

In [None]:
pd.pivot_table(df, index=['Name', 'Product'], columns=['Status'], values=['Price'], fill_value=0, margins=True)

In [None]:
pd.pivot_table(df,index=['Manager', 'Status'], 
               columns=['Product'],
               values=['Quantity', 'Price'],
               aggfunc={'Quantity': len, 'Price': [np.sum, np.mean]},
               fill_value=0)

### Challenges

---

## Analyzing Public Datasets

### Dataset 1: Adult

จากข้อมูล [Adult](https://archive.ics.uci.edu/ml/datasets/adult) ลองตอบคำถามต่อไปนี้

In [None]:
adult_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv(adult_data_url, names=columns)

ในข้อมูลชุดนี้กลุ่มอายุที่มีจำนวนน้อยที่สุดคือกลุ่มอายุเท่าไหร่?

In [None]:
adult.age.value_counts(ascending=True)[0:5]

กลุ่มอายุที่มีจำนวนคนมากที่สุดคือกลุ่มอายุเท่าไหร่ และมีกี่คน?

In [None]:
adult.age.value_counts()

จากกลุ่มอายุที่ได้มาข้างต้น มีเพศชายกี่คน และเพศหญิงกี่คน?

In [None]:
adult[adult.age == adult.age.value_counts().index[0]]['sex'].value_counts()

### Dataset 2: Amazon Review

เลือกข้อมูล Amazon review ที่เป็น 5-core จาก http://jmcauley.ucsd.edu/data/amazon/ ของ Julian McAuley

```
reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin - ID of the product, e.g. 0000013714
reviewerName - name of the reviewer
helpful - helpfulness rating of the review, e.g. 2/3
reviewText - text of the review
overall - rating of the product
summary - summary of the review
unixReviewTime - time of the review (unix time)
reviewTime - time of the review (raw)
```

**หมายเหตุ:** ข้อมูลนี้ใช้ทางด้านการวิจัยเท่านั้น :)

โหลดข้อมูลมาเข้า DataFrame

In [2]:
import pandas as pd

import gzip
music_review_lines = gzip.open('data/reviews_Digital_Music_5.json.gz', 'rt').readlines()

import json
df = pd.DataFrame(list(map(json.loads, music_review_lines)))

In [3]:
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,5555991584,"[3, 3]",5.0,"It's hard to believe ""Memory of Trees"" came ou...","09 12, 2006",A3EBHHCZO6V2A4,"Amaranth ""music fan""",Enya's last great album,1158019200
1,5555991584,"[0, 0]",5.0,"A clasically-styled and introverted album, Mem...","06 3, 2001",AZPWAXJG9OJXV,bethtexas,Enya at her most elegant,991526400
2,5555991584,"[2, 2]",5.0,I never thought Enya would reach the sublime h...,"07 14, 2003",A38IRL0X2T4DPF,bob turnley,The best so far,1058140800
3,5555991584,"[1, 1]",5.0,This is the third review of an irish album I w...,"05 3, 2000",A22IK3I6U76GX0,Calle,Ireland produces good music.,957312000
4,5555991584,"[1, 1]",4.0,"Enya, despite being a successful recording art...","01 17, 2008",A1AISPOIIHTHXX,"Cloud ""...""",4.5; music to dream to,1200528000


เก็บข้อมูล quarter ของแต่ละ review ในคอลัมภ์ใหม่ชื่อ quarter

In [4]:
df['unixReviewTime'] = pd.to_datetime(df['unixReviewTime'], unit='s')
df['quarter'] = df.unixReviewTime.dt.quarter
df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,quarter
0,5555991584,"[3, 3]",5.0,"It's hard to believe ""Memory of Trees"" came ou...","09 12, 2006",A3EBHHCZO6V2A4,"Amaranth ""music fan""",Enya's last great album,2006-09-12,3
1,5555991584,"[0, 0]",5.0,"A clasically-styled and introverted album, Mem...","06 3, 2001",AZPWAXJG9OJXV,bethtexas,Enya at her most elegant,2001-06-03,2
2,5555991584,"[2, 2]",5.0,I never thought Enya would reach the sublime h...,"07 14, 2003",A38IRL0X2T4DPF,bob turnley,The best so far,2003-07-14,3
3,5555991584,"[1, 1]",5.0,This is the third review of an irish album I w...,"05 3, 2000",A22IK3I6U76GX0,Calle,Ireland produces good music.,2000-05-03,2
4,5555991584,"[1, 1]",4.0,"Enya, despite being a successful recording art...","01 17, 2008",A1AISPOIIHTHXX,"Cloud ""...""",4.5; music to dream to,2008-01-17,1


เลือกข้อมูลที่มี overall rating 5 เฉพาะวันจันทร์ และมีคำว่า love ใน review

In [5]:
df[(df.overall == 5) & (df.unixReviewTime.dt.dayofweek == 0) & (df.summary.str.contains('love'))]

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,quarter
9,5555991584,"[12, 12]",5.0,"Many times, AND WITH GOOD REASON, the ""new age...","05 12, 2003",A33TRNCQK4IUO7,guillermoj,"A true gem, even if you don't love conventiona...",2003-05-12,2
702,B00000053X,"[0, 0]",5.0,I've been a huge fan of the Backstreet Boys si...,"06 4, 2001",A1LDIMQISH07WK,Rebecca,I love this CD!!!,2001-06-04,2
871,B00000064F,"[3, 4]",5.0,Nick Drake's &quot;Bryter Layter&quot; is an a...,"03 4, 2002",A23D8ZZ60GTS3G,Patrice A. Williams,I love Nick Drake's Music it is wonderful in e...,2002-03-04,1
1404,B000000OME,"[15, 15]",5.0,I liked this album from the very first time I ...,"01 27, 2003",AZPWAXJG9OJXV,bethtexas,I love all her different sides!,2003-01-27,1
2646,B00000163G,"[0, 0]",5.0,This album will undoubtedly remain one of the ...,"12 17, 2001",A2EONRA9TLDADW,"""carakay2""","I won't deny it, I love this album",2001-12-17,4
2937,B000001A5X,"[7, 7]",5.0,"Ok This man was so on point, even in the 70's!...","06 12, 2006",AXWHRGDAA8BZD,N. Jackson,I love this Album!,2006-06-12,2
3876,B000001DZO,"[3, 3]",5.0,ABBA. They are pop culture...of the 70s. Gre...,"10 21, 2002",A2J9Q2DYOR81U8,Mr. Wynn,Every song a hit! You'll love 'em!,2002-10-21,4
4678,B000001EW7,"[2, 2]",5.0,I've always liked 10cc and I think this is a v...,"11 21, 2005",ARIGCQMVCXY2E,musicfanatic,I love 10cc!,2005-11-21,4
5470,B000001FOJ,"[1, 1]",5.0,This in my opinion along with Earth Wind & Fir...,"07 23, 2007",A1WHO8TXC8GJHA,"Eric Robinson ""Big E""",A must have for any P-Funk lover.,2007-07-23,3
5762,B000001FZ6,"[101, 113]",5.0,"""Have you ever loved a woman, so much you're t...","08 27, 2001",A3D6TFYRMIV3ZL,Themis-Athena,"Consummate blues, born out of the pain of unfu...",2001-08-27,3


หาสินค้าที่มีคนรีวีวในวันที่มีคนรีวีวมากที่สุด

In [17]:
days = df.unixReviewTime.value_counts()
df[df.unixReviewTime == days.index[0]]

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,quarter
667,B00000053Q,"[1, 1]",5.0,"Classic cd, represents the south good, with so...","03 4, 2005",A2BSCF3SOLIYSN,"Lil Flit ""Flit""",Free pimp c,2005-03-04,1
2023,B000000WGF,"[1, 19]",1.0,"If Paula Abdul ever records another album, my ...","03 4, 2005",A10ID9PBP48AXN,"Autumn Sadness ""Hello Hello""",NOBODY WANTS ABDUL ON RADIO ANYMORE!,2005-03-04,1
3540,B000001DQG,"[0, 0]",4.0,I share the common belief that Caribou (named ...,"03 4, 2005",A15JWX1RNJ3H6C,"Jonathan M. Goodman ""Blue Suede Schubert""",Mediocre Elton is Still Worth 4 Stars,2005-03-04,1
4605,B000001EW3,"[4, 4]",4.0,I'll keep this short and sweet since everyone ...,"03 4, 2005",A8O13GJJDU01Z,"Christopher ""chrysaetos""",Subtley Remixed,2005-03-04,1
4830,B000001F4X,"[0, 1]",5.0,"Sayreville,New Jersey-based Bon Jovi put out t...","03 4, 2005",A1IKOYZVFHO1XP,andy8047,An awesome New Jersey band!,2005-03-04,1
7324,B00000256E,"[3, 9]",5.0,I RECENTLY SAW NEIL IN CONCERT. I GOT SO EXCI...,"03 4, 2005",AIZ1I64N9CZA1,music lover,HE SENDS ME,2005-03-04,1
7822,B0000025ED,"[2, 2]",5.0,"Billy Joel's fifth album, ""The Stranger"" was a...","03 4, 2005",A2NQUGGYM0DBM1,L.A. Scene,Outstanding Collection - many tracks underrated,2005-03-04,1
9515,B00000273W,"[16, 19]",5.0,For anyone who walked in on Fishbone at the Tr...,"03 4, 2005",A2CHLPRJZSD6U9,"Mr. S. St Thomas ""suckerfly""",The Reality of My Surroundings(1991),2005-03-04,1
10678,B000002AP1,"[2, 4]",4.0,Iggy remastered this baby. Prepare for your sp...,"03 4, 2005",AO932DNYD4IQ6,Jack Knife,This CD actually physically punishes you,2005-03-04,1
11570,B000002GAG,"[0, 0]",5.0,This album propelled Louis Johnson as one of t...,"03 4, 2005",A17RFKCYS69M3Y,Tall Paul,FUNK BLAM!! The Thump Bass Anthem,2005-03-04,1


หารีวีวของสินค้า 3 อันดับแรกที่มีคนรีวีวมากที่สุด

In [18]:
products = df.asin.value_counts()[0:3].index
df[df.asin.isin(products)]

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,quarter
38362,B00006690F,"[18, 24]",2.0,"At the outset, let me make it clear that I hav...","08 24, 2002",AYQY3RB33F4J6,3rdeadly3rd,The Real Slim Shady Steadfastly Refuses To Sta...,2002-08-24,3
38363,B00006690F,"[2, 4]",4.0,If you want to keep hearing Eminem's hilarious...,"06 15, 2002",A11Z2YS0Q5LKZB,"A being ""Me""",Jimmy Conaway is out of his mind,2002-06-15,2
38364,B00006690F,"[4, 6]",2.0,... Eminem is far from the best rapper in the ...,"03 27, 2003",A1PNKDX2KGXRZ5,A Customer,Eminem = the most overplayed and overrated rap...,2003-03-27,1
38365,B00006690F,"[5, 13]",1.0,EMINEM IS ANOTHER HIGLY OVERATED RAPPER EMINEM...,"07 27, 2003",AAF8RFWLK6AZD,A Customer,THE EMINEM SHOW,2003-07-27,3
38366,B00006690F,"[1, 3]",3.0,It seems like Eminem has just fallen down the ...,"02 23, 2004",A30VQ7PHNAGRML,Adam,hmmmm,2004-02-23,1
38367,B00006690F,"[1, 3]",4.0,"Actually, to me, this album gets 4 1/2 stars. ...","11 15, 2002",A2FL41BCS486E8,Adriana Hernandez,Eminem shows why he's one of the best,2002-11-15,4
38368,B00006690F,"[0, 0]",5.0,I honestly didn't want to buy this CD. My knee...,"08 4, 2002",AZSN1TO0JI87B,A. Estes,Even I'll admit this is good,2002-08-04,3
38369,B00006690F,"[0, 3]",2.0,This has to be one of the worst albums I ever ...,"08 30, 2005",A22N9H8V0RYQR3,A fair and Balanced Rater,Exactly...How did this guy become famous??,2005-08-30,3
38370,B00006690F,"[1, 1]",4.0,Like any rapper who lasts more than 15 minutes...,"07 8, 2002",A2OZBJ58CML9OS,A. Gammill,The Epic Trilogy Concludes...,2002-07-08,3
38371,B00006690F,"[5, 6]",5.0,This is Eminem's third and best album so far. ...,"06 22, 2002",AYB6IIG5BFLH1,AG,Third times the charm...,2002-06-22,2


หาค่า rating เฉลี่ยของแต่ละสินค้า

In [20]:
df.groupby('asin').overall.mean().sort_values(ascending=False)

asin
B000002652    5.000000
B000002WF8    5.000000
B000001E56    5.000000
B000002KJ4    5.000000
B000001EFF    5.000000
B000002KH1    5.000000
B000002KGE    5.000000
B000002KF2    5.000000
B0000032UL    5.000000
B000001EIU    5.000000
B000001EOG    5.000000
B0026IZR48    5.000000
B00000DS2N    5.000000
B000002KC2    5.000000
B00000335I    5.000000
B000001FCB    5.000000
B000001FCM    5.000000
B000002JKY    5.000000
B00000DFFZ    5.000000
B000001FIH    5.000000
B007V1VS1G    5.000000
B000AO4NKE    5.000000
B000002JBB    5.000000
B000002KLL    5.000000
B00000J2PJ    5.000000
B000078JLP    5.000000
B000001DYH    5.000000
B00009EJC7    5.000000
B000002LMO    5.000000
B001HDYKQ4    5.000000
                ...   
B001XJTB8E    2.600000
B00HFEC192    2.600000
B000000OWW    2.600000
B000F0UV3Q    2.592593
B0007NFL18    2.558824
B001H9N884    2.500000
B0009SCVTG    2.494253
B00004XOWM    2.473214
B0009K7RBG    2.437500
B005RVXYMI    2.400000
B001X3EQLW    2.400000
B001L2BIHK    2.400000
B000HC

หาค่า rating ของผู้ใช้แต่ละคน

In [21]:
df.groupby('reviewerID').overall.mean().sort_values(ascending=False)

reviewerID
A2VBZ6DCKU3J73    5.000000
A3MI5H973EAXLC    5.000000
A3VKMUJG8DBODG    5.000000
A3VNV65XYKKY6A    5.000000
A3VWGTIJ2UIXRL    5.000000
A1X4WM1OCKR0WN    5.000000
A3W2UXRJ8SR6RN    5.000000
A3X246310365H     5.000000
A1WWABHEZN2I6N    5.000000
A4622J2IA15MN     5.000000
A1WTYALNQNQG99    5.000000
A1WLZYEOIL1HLT    5.000000
A4F9P9ZRYHLVB     5.000000
A1W3ZAKFIDGM13    5.000000
A1W3FS1C2H4RCV    5.000000
A4PPZNQF1X2IY     5.000000
A1W29LVYUPQNMM    5.000000
A1XJPGV4KPIQWM    5.000000
A1XOGQMVS10UMR    5.000000
A1XTDBWZSMGHUO    5.000000
A3TKDX13QLFEJH    5.000000
A3T4JO848Y08UC    5.000000
A1ZVWTEYIA3GU4    5.000000
A1ZTC02LF6GX1H    5.000000
A3TI1WDAEOTD0I    5.000000
A3TIT4H0ZNOD3O    5.000000
A3TJ0WG5IJ22Y9    5.000000
A1ZF2MSBDUM5SF    5.000000
A1XUD7MX2GJ48Y    5.000000
A1Z8PG2MDCX78W    5.000000
                    ...   
A178ODL6VOBR1X    1.666667
A2872PI8C2CQ7E    1.666667
ANH7ENGKJU9D6     1.666667
A3L61G6N7AT8N2    1.666667
A36489F4G8T4E7    1.657895
A1R74HBHI7PISK   

หาค่า rating เฉลี่ยของแต่ละวันใน 1 อาทิตย์

In [22]:
df.groupby(df.unixReviewTime.dt.dayofweek).overall.mean()

unixReviewTime
0    4.231810
1    4.197084
2    4.215898
3    4.221097
4    4.232799
5    4.232467
6    4.233472
Name: overall, dtype: float64

หาค่า standard deviation ของ rating ของแต่ละสินค้า

In [23]:
df.groupby('asin').overall.std().sort_values(ascending=False)

asin
B0051QK8PU    2.190890
B0000035DC    2.190890
B000002IYI    2.000000
B000VFIDQM    1.995531
B003Y3USS4    1.974842
B004RZTNUM    1.951800
B00FX8FWRU    1.949359
B001B65PBQ    1.949359
B005I0DM52    1.949359
B002I53BL0    1.889822
B0047GL56G    1.870829
B002BPKWH8    1.870829
B00004SSTE    1.864454
B000CSUYFQ    1.864454
B0007LLPF6    1.855921
B00FX8F5HW    1.834848
B00IDXIP1M    1.834848
B004S5JBZ8    1.834848
B003Y3ZTH4    1.816590
B007U732B0    1.816590
B000066SHX    1.816590
B000050HSJ    1.812654
B0013LL04A    1.799471
B00GTZ6O2S    1.788854
B000058E2O    1.788854
B00299CER2    1.788854
B008AV39AO    1.788854
B001D0T4JO    1.788854
B0013FSVD4    1.788854
B000F4TM6Y    1.763603
                ...   
B000002N2B    0.000000
B000002KF2    0.000000
B000002KGE    0.000000
B000002W5F    0.000000
B000002KH1    0.000000
B000006LEV    0.000000
B00009EJC7    0.000000
B000002MKD    0.000000
B00003W1QG    0.000000
B000007NCE    0.000000
B00128X6Z0    0.000000
B000002H7I    0.000000
B00000

### Dataset 3: Stanford Open Policing

Download "Connecticut" from https://openpolicing.stanford.edu/data/

In [5]:
df = pd.read_csv('data/CT-clean.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
df.head()

Unnamed: 0,id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,...,violation_raw,violation,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,officer_id,stop_duration
0,CT-2013-00001,CT,2013-10-01,00:01,westport,Fairfield County,9001.0,"00000 N I 95 (WESTPORT, T158) X 18 LL",State Police,F,...,Speed Related,Speeding,False,,,False,Ticket,False,1000002754,1-15 min
1,CT-2013-00002,CT,2013-10-01,00:02,mansfield,Tolland County,9013.0,rte 195 storrs,State Police,M,...,Moving Violation,Moving violation,False,,,False,Verbal Warning,False,1000001903,1-15 min
2,CT-2013-00003,CT,2013-10-01,00:07,franklin,New London County,9011.0,Rt 32/whippoorwill,State Police,M,...,Speed Related,Speeding,False,,,False,Ticket,False,1000002711,1-15 min
3,CT-2013-00004,CT,2013-10-01,00:10,danbury,Fairfield County,9001.0,I-84,State Police,M,...,Speed Related,Speeding,False,,,False,Written Warning,False,113658284,1-15 min
4,CT-2013-00005,CT,2013-10-01,00:10,east hartford,Hartford County,9003.0,"00000 W I 84 (EAST HARTFORD, T043)E.OF XT.56",State Police,M,...,Speed Related,Speeding,False,,,False,Ticket,False,830814942,1-15 min


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318669 entries, 0 to 318668
Data columns (total 24 columns):
id                       318669 non-null object
state                    318669 non-null object
stop_date                318669 non-null object
stop_time                318447 non-null object
location_raw             318628 non-null object
county_name              318627 non-null object
county_fips              318627 non-null float64
fine_grained_location    317006 non-null object
police_department        318669 non-null object
driver_gender            318669 non-null object
driver_age_raw           318669 non-null int64
driver_age               318395 non-null float64
driver_race_raw          318669 non-null object
driver_race              318669 non-null object
violation_raw            318669 non-null object
violation                318669 non-null object
search_conducted         318669 non-null bool
search_type_raw          4846 non-null object
search_type              484

ผู้หญิงหรือผู้ชาย ใครขับเร็วกว่ากัน?

ผู้ชายทำผิดอะไรบ้าง แล้วอะไรเยอะที่สุด?

ผู้หญิงทำผิดอะไรบ้าง แล้วอะไรเยอะที่สุด?

เพศไหนโดนเรียกให้หยุดค้นเยอะที่สุด?

ทำไม `search_type` มี missing data เยอะสุด?

In [10]:
df.search_conducted.value_counts()

False    313337
True       5332
Name: search_conducted, dtype: int64

`search_type` จะหายไปแน่ๆ ถ้า `search_conducted` เป็น False

In [12]:
df[df.search_conducted == False].search_type.value_counts()

Series([], Name: search_type, dtype: int64)

ปีไหนมีจำนวน stop น้อยที่สุด?

แต่ละช่วงของวันมีคนโดนจับเป็นจำนวนเท่าไหร่บ้าง?