# Lesson 5

## Data Terminology

### Variables

- Dependent variable (DV): target variable (variable that we interested in predicting)
- Independent variable (IV) : prediction variables (variables whose values are used to predict target variable values)
- Another way to think about DVs and IVs is DV=Y, IV = x
  - Y = aX + b

### Observations

- Case:
  - a collection of DV and IV values representing a single observational unit
  - represented by one row in a CSV file or other data file
  - e.g. single person, single country, etc.
- Observation:
  - a case that has been measured experimentally
 

### Load, read, parse

- "load" is simply transferring data from a file to temporary memory (i.e., RAM)
- "read" is transferring a portion of a data file to temporary memory
- "parse" is interpreting / extracting / reducing / organizing data that have been loaded or read 

# Final Project Dataset Requirements and Recommendations

### Data type (recommendation)

- Use only numerical values
  - other values (dates, strings, etc) can also be used but are usually more difficult to parse
- Use one or more CSV file
  - other data file formats are fine but may be more difficult to work with 

### CSV structure (original or parsed data) (recommendation)

- Columns : Case, DV0, DV1, DV2, DV3,....., IV0, IV1,....
  

### Number of cases (requirement)

- 100 or more cases
- 200-1000 is best, but more is OK too

### Numbers of Variables (requirement)

- at least 1 DVs
  - more than 1 DV is OK but will complicate analyses
  - recommend against using more than 3 DVs
- at least 3 IVs
  - more than 3 is OK
  - less than 10 is best, otherwise some analyses become impossible 

# Phyton Lesson 

## Dictionary 

A dictionary is a collection of variables that are string-indexed

The string indicies are called "keys"

A key is used to index a value in the dictionary 


In [1]:
d = dict (aaa=123, bb=5, cc=100)

print (d)
print (d['aaa'])

{'aaa': 123, 'bb': 5, 'cc': 100}
123


# Reading and parsing CSV Data 

In [4]:
import pandas as pd

df = pd.read_csv('winequality-red.csv')

#df : Pndas dataframe object 

print(df)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0               7.4             0.700         0.00             1.9      0.076   
1               7.8             0.880         0.00             2.6      0.098   
2               7.8             0.760         0.04             2.3      0.092   
3              11.2             0.280         0.56             1.9      0.075   
4               7.4             0.700         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
1594            6.2             0.600         0.08             2.0      0.090   
1595            5.9             0.550         0.10             2.2      0.062   
1596            6.3             0.510         0.13             2.3      0.076   
1597            5.9             0.645         0.12             2.0      0.075   
1598            6.0             0.310         0.47             3.6      0.067   

      free sulfur dioxide  

In [5]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [6]:
print(df['quality'])

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64


In [7]:
# first column : just the column number
# second column : the actual values

In [8]:
df1 = df[ ['quality', 'volatile acidity']]

print (df1)

      quality  volatile acidity
0           5             0.700
1           5             0.880
2           5             0.760
3           6             0.280
4           5             0.700
...       ...               ...
1594        5             0.600
1595        6             0.550
1596        6             0.510
1597        5             0.645
1598        6             0.310

[1599 rows x 2 columns]


In [10]:
df1 = df1.rename (columns = {'volatile acidity' : 'acid'})

print (df1)

      quality   acid
0           5  0.700
1           5  0.880
2           5  0.760
3           6  0.280
4           5  0.700
...       ...    ...
1594        5  0.600
1595        6  0.550
1596        6  0.510
1597        5  0.645
1598        6  0.310

[1599 rows x 2 columns]


In [11]:
print (df1['acid'])

0       0.700
1       0.880
2       0.760
3       0.280
4       0.700
        ...  
1594    0.600
1595    0.550
1596    0.510
1597    0.645
1598    0.310
Name: acid, Length: 1599, dtype: float64
