1. **Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?**

In [2]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.6-py3-none-any.whl.metadata (5.3 kB)
Downloading ucimlrepo-0.0.6-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.6


In [8]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
abalone = fetch_ucirepo(id=1) 
  
# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 

In [18]:
X.isnull().any() # check if the columns have missing values

Sex               False
Length            False
Diameter          False
Height            False
Whole_weight      False
Shucked_weight    False
Viscera_weight    False
Shell_weight      False
dtype: bool

NO column has missing value.

In [19]:
X.dtypes # get the type of each column

Sex                object
Length            float64
Diameter          float64
Height            float64
Whole_weight      float64
Shucked_weight    float64
Viscera_weight    float64
Shell_weight      float64
dtype: object

In [20]:
X.iloc[1:3]

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21


The type of `Sex` is defined in the dataset as M or F. The types of other properties are float. 

2. **Try indexing and selecting data columns by name rather than by column number. The pandas documentation on [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) has further details on how to do this.**

In [17]:
X['Length'].max()

0.815

3. **How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?** 

<font color=red>(uncertain)</font>

The size of the dataset we could load is basically limited by RAM size. For my computer, RAM is 32G. This doesn't mean the size of dataset I can read. The way to represent it and process it also affect the usage of RAM. I tried to load 1GB image data by python and it cost RAM. 

Server memory uses more advanced technology. Because it usually has to run continuously for a very long time, it needs to be stable and have error correction capabilities. What differents of 

4. **How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?**

<font color=green>(from the dicussion)</font>

Too much uniqueness of categorical values indicate that the amount of information that the feature carries is poor. We may safely exclude a feature if all its values are unique. Manual or automatic concatenation of categories might be required.

5. **What alternatives to pandas can you think of? How about [loading NumPy tensors from a file](https://numpy.org/doc/stable/reference/generated/numpy.load.html)? Check out [Pillow](https://python-pillow.org/), the Python Imaging Library.**

In [4]:
# Load numpy tensors from a file
import numpy as np
a = np.zeros((3,3))
np.save('material/a.npy', a)
b = np.load('material/a.npy')
b

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [5]:
# Load image by Pillow
from PIL import Image
img = Image.open("material/cat.png")
img.show()