#  给定一个数据集（表excel,tsv,csv,data）
## 1. 宏观认识
1.1. 表有多少行，多少列（字段）。

1.2. 都有什么字段，每个字段是什么数据类型，缺失值情况怎么样。

1.3. 如果是数值列，他们的统计学指标比如均值，最大值等是什么。

1.4. 表的索引情况
## 2. 微观认识
2.1. 看几行样例数据，基于字段的名称或数据集的说明去理解每一列的物理意义。

2.2. 针对具体的列，如果是枚举类型，一共枚举值有多少个。

2.3. 针对具体的列，统计出现的频次，理解出现频次最多的元素。

In [62]:
import pandas as pd
import numpy as np

# 第一个数据集

In [63]:
file_address = "data/chipotle.tsv"
chipo = pd.read_csv(file_address, sep='\t')

### 表一共有多少行，多少列

In [64]:
print(chipo.shape)
print(chipo.shape[0])
print(chipo.shape[1])

(4622, 5)
4622
5


### 表列名是什么(都有哪些字段)
### 表的基础信息
有哪些列，每一列的数据类型是什么，每一列的缺失值情况等

In [65]:
print(chipo.columns)
print("##########")
print(chipo.dtypes)
print("##########")
print(chipo.info())

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')
##########
order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object
##########
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
None


### 表的索引是什么

In [66]:
print(chipo.index)

RangeIndex(start=0, stop=4622, step=1)


### 展示头部，尾部N行

In [67]:
print(chipo.head(4))
print("##########")
print(chipo.tail())

   order_id  quantity                              item_name  \
0         1         1           Chips and Fresh Tomato Salsa   
1         1         1                                   Izze   
2         1         1                       Nantucket Nectar   
3         1         1  Chips and Tomatillo-Green Chili Salsa   

  choice_description item_price  
0                NaN      $2.39  
1       [Clementine]      $3.39  
2            [Apple]      $3.39  
3                NaN      $2.39  
##########
      order_id  quantity           item_name  \
4617      1833         1       Steak Burrito   
4618      1833         1       Steak Burrito   
4619      1834         1  Chicken Salad Bowl   
4620      1834         1  Chicken Salad Bowl   
4621      1834         1  Chicken Salad Bowl   

                                     choice_description item_price  
4617  [Fresh Tomato Salsa, [Rice, Black Beans, Sour ...     $11.75  
4618  [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...     $11.75  
46

# 第二个数据集

In [68]:
file_address = "data/u.user"
users = pd.read_csv(file_address, sep='|', index_col='user_id')

## 宏观认识数据集

表有多少行，多少列（字段）。

都有什么字段，每个字段是什么数据类型，缺失值情况怎么样。

In [69]:
print(users.shape, users.shape[0], users.shape[1])
print("##########")
print(users.columns)
print("##########")
print(users.dtypes)
print("##########")
print(users.info())

(943, 4) 943 4
##########
Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')
##########
age            int64
gender        object
occupation    object
zip_code      object
dtype: object
##########
<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 1 to 943
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   age         943 non-null    int64 
 1   gender      943 non-null    object
 2   occupation  943 non-null    object
 3   zip_code    943 non-null    object
dtypes: int64(1), object(3)
memory usage: 36.8+ KB
None


数值列，计算统计学指标比如均值，最大值等是什么。

In [70]:
print(users.describe())
print("##########")
print(users.describe(include = "all"))
print("##########")
print(users.occupation.describe())

              age
count  943.000000
mean    34.051962
std     12.192740
min      7.000000
25%     25.000000
50%     31.000000
75%     43.000000
max     73.000000
##########
               age gender occupation zip_code
count   943.000000    943        943      943
unique         NaN      2         21      795
top            NaN      M    student    55414
freq           NaN    670        196        9
mean     34.051962    NaN        NaN      NaN
std      12.192740    NaN        NaN      NaN
min       7.000000    NaN        NaN      NaN
25%      25.000000    NaN        NaN      NaN
50%      31.000000    NaN        NaN      NaN
75%      43.000000    NaN        NaN      NaN
max      73.000000    NaN        NaN      NaN
##########
count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object


In [71]:
print(users.age.mean())
print("##########")
print(users.age.describe()["mean"])

34.05196182396607
##########
34.05196182396607


表的索引情况

In [72]:
print(users.index)

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)


## 微观情况

看几行样例数据，基于字段的名称或数据集的说明去理解每一列的物理意义。

In [73]:
print(users.head())
print("##########")
print(users.tail(3))

         age gender  occupation zip_code
user_id                                 
1         24      M  technician    85711
2         53      F       other    94043
3         23      M      writer    32067
4         24      M  technician    43537
5         33      F       other    15213
##########
         age gender occupation zip_code
user_id                                
941       20      M    student    97229
942       48      F  librarian    78209
943       22      M    student    77841


针对具体的列，如果是枚举类型，一共枚举值有多少个。

In [74]:
print(users.occupation.unique())
print("##########")
print(users.occupation.nunique())
print("##########")
print(users.occupation.describe()["unique"])

['technician' 'other' 'writer' 'executive' 'administrator' 'student'
 'lawyer' 'educator' 'scientist' 'entertainment' 'programmer' 'librarian'
 'homemaker' 'artist' 'engineer' 'marketing' 'none' 'healthcare' 'retired'
 'salesman' 'doctor']
##########
21
##########
21


针对具体的列，统计出现的频次，理解出现频次最多的元素。

In [75]:
print(users.occupation)
print("##########")
print(users['occupation'])

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object
##########
user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object


In [76]:
print(users.occupation.value_counts())

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
homemaker          7
doctor             7
Name: occupation, dtype: int64


In [77]:
print(users.occupation.value_counts().head(1))
print("##########")
print(users.occupation.value_counts().head(1).index)
print("##########")
print(users.occupation.value_counts().head(1).index[0])
print("##########")
print(users.occupation.value_counts().index[0])

student    196
Name: occupation, dtype: int64
##########
Index(['student'], dtype='object')
##########
student
##########
student


# 第三个数据集

In [78]:
file_address = "data/en.openfoodfacts.org.products.tsv"
food = pd.read_csv(file_address, sep='\t')

  food = pd.read_csv(file_address, sep='\t')


表有多少行，多少列（字段）

(356027, 163)

都有什么字段，每个字段是什么数据类型，缺失值情况怎么样。

code                        object
url                         object
creator                     object
created_t                   object
created_datetime            object
                            ...   
carbon-footprint_100g      float64
nutrition-score-fr_100g    float64
nutrition-score-uk_100g    float64
glycemic-index_100g        float64
water-hardness_100g        float64
Length: 163, dtype: object
##
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB
None


如果是数值列，他们的统计学指标比如均值，最大值等是什么

       no_nutriments    additives_n  ingredients_from_palm_oil_n  \
count            0.0  283867.000000                283867.000000   
mean             NaN       1.876851                     0.023430   
std              NaN       2.501022                     0.153094   
min              NaN       0.000000                     0.000000   
25%              NaN       0.000000                     0.000000   
50%              NaN       1.000000                     0.000000   
75%              NaN       3.000000                     0.000000   
max              NaN      30.000000                     2.000000   

       ingredients_from_palm_oil  ingredients_that_may_be_from_palm_oil_n  \
count                        0.0                            283867.000000   
mean                         NaN                                 0.059736   
std                          NaN                                 0.280660   
min                          NaN                                 0.000000   
25

表的索引情况

RangeIndex(start=0, stop=356027, step=1)


看几行样例数据，基于字段的名称或数据集的说明去理解每一列的物理意义。

    code                                                url  \
0   3087  http://world-en.openfoodfacts.org/product/0000...   
1   4530  http://world-en.openfoodfacts.org/product/0000...   
2   4559  http://world-en.openfoodfacts.org/product/0000...   
3  16087  http://world-en.openfoodfacts.org/product/0000...   
4  16094  http://world-en.openfoodfacts.org/product/0000...   

                      creator   created_t      created_datetime  \
0  openfoodfacts-contributors  1474103866  2016-09-17T09:17:46Z   
1             usda-ndb-import  1489069957  2017-03-09T14:32:37Z   
2             usda-ndb-import  1489069957  2017-03-09T14:32:37Z   
3             usda-ndb-import  1489055731  2017-03-09T10:35:31Z   
4             usda-ndb-import  1489055653  2017-03-09T10:34:13Z   

  last_modified_t last_modified_datetime                    product_name  \
0      1474103893   2016-09-17T09:18:13Z              Farine de blé noir   
1      1489069957   2017-03-09T14:32:37Z  Banana Chips Sweetened (

针对具体的列"creator"，一共枚举值有多少个。

3890
#############
3890


统计"creator"出现的频次，理解出现频次最多的元素

找出频次最多的元素

usda-ndb-import               169868
openfoodfacts-contributors     45805
kiliweb                        36379
date-limite-app                12679
openfood-ch-import             11469
                               ...  
leleio                             1
bora                               1
sevede28                           1
brunoa                             1
climboxing                         1
Name: creator, Length: 3890, dtype: int64
#############
usda-ndb-import
