# ICS 434 DATA SCIENCE FUNDAMENTALS

## BRIEF INTRODUCTION


### What is Data Science?


* A field in which there is no consensus on its exact definition

```
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.
```
https://en.wikipedia.org/wiki/Data_science


### What is Data Science? - Cont'd

* Aim: to transform raw data into actionable knowledge.

* Data Science process can be described by the DIKW Pyramid

  * A step of the pyramid adds value or answers a question by using its lower layers.
    
<img src="https://www.dropbox.com/s/7mh80nc9jctirng/data_to_wisdom.png?dl=1" alt="drawing" style="width:500px;"/>


### DIKW Pyramid: Data

* Data is the fundamental block of the DIKW pyramid and data science.
   * No data science without data.
   
* Data can arise from various processes. Ex.:
    * Manually collected or generated (Ex. phone through surveys, user photos).
    * Automatically generated (Ex. Traffic sensors, IoT).
    * Simulated (ex. weather or astronomy models).
    
* Raw Data rarely has any intrinsic value for the entity that collects it.
  * Need to process it, interrogate and perhaps compare or enrich it with other datasets to generate extrinsic value.

### DIKW Pyramid: Information

* The representation of the processed data
    * Data cleaned (ex. remove or impute missing values), disambiguated (ex. fix alternative spellings), aggregation or enrichment (ex. extend zip code with median revenue in that neighborhood), etc.
    * Exploratory data analysis (ex. visualizing the distribution of a variable, or computing summary statistics) can generate valuable information.
        * Ex.:
            * The spending on any day of the week has a log-tail.
            * The mean time usage of my app is ~1.5 hours per day.
        
* While *informative*, information gathered may not represent valuable knowledge.
    * Not directly *actionable* and does not directly lead to specific conclusions.
   

### DIKW Pyramid: Knowledge

* Extracting non-obvious insight

* I.e., how is the information gathered from the data relevant to our end-goals?

    * Are there differences between males and females' usage of my app
    What features of my app are strongly correlated with long usage?

* As a business or as a researcher, the gathered knowledge is what you are trying to gather as an edge.


### DIKW Pyramid: Wisdom

* Wisdom is knowledge applied in action.

* In contrast to deductive science, the knowledge gained may not be certain, and the actions taken may not be optimal.
  * Conclusions are probable and reasonable and actions are reasonably favorable.

* Feature X leads to longer use time. Adding more features like X will further increase use time.
* Feature Y is the least used one in the app. Adding features like Y will is not a good use of time and effort.  


###  Data Science: Business Definition 

```
The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.

  - Hal Varian, Chief Economist at Google and Founding Dean of the School of Information at UC Berkeley.
```



### What Is Data Science?



<img src="https://www.dropbox.com/s/7rn0bjpp315p1km/datascience_venn_berkeley.png?dl=1" alt="drawing" style="width:600px;"/>

### Data Science In Academia

* Increasing number of academic sub-disciplines that focus on data analysis.

* Sub-disciplines with adjectives such as “informatics”, “computational” or “quantitative” in front of them.

  * Ecoinformatics, chemoinformatics, social informatics, computational biology, quantitative finance, computational linguistics, ...

* I consider these to be manifestations of data science in those areas



### Is Data Science New?

* Researchers in academia and the industry have been working DIKW process for many years.
  * Many fields combine CS, math/stats, engineering, and domain expertise.
  * Walk on the 4th, 5th, 6th, and 7th Floors of POST and read some of the posters on the wall.
    * Ask astronomers or biologists working on genomics or researchers in econometrics to cite only a few.
* Reliance on complex data was not, as the media portrays it, created yesterday.
* Many data scientists in the industry are biologists, experimental physicists, astrophysicists, oceanographers, etc.

### Why the sudden excitement?

* New technology makes it possible to capture vast amounts of data (breadth and width).
  * Apple Watch makes it easier to get heart rate sensors from millions of people (breadth)
  * IoT advances make it easier to measure a broad range of new signals (width) 
  
* A trend of "__datafication__" or taking all aspects of life and turning them into data.
  * LinkedIn "datafies", among other things, professional relevance and HR churn.
  * Mastercard "datafies" you as a customer, patient, citizen, etc...


* Existing tools and methods can now scale to these new types of complex data.


### A Data Science Renaissance

* Computing advances make it possible to analyze data on ever-increasing scales and complexities.

* Realization that the world will be increasingly “datafied” fuels new interest in research on the analysis of complex data.
  * Complex data can be large or tiny data.
  * Immense complexity if combining heterogeneous data. For example, how do you combine and extract knowledge form:
    * Molecular (Cell-level) data, 
    * Hospital medical records,
    * Apple Watch data (heart rate, physical activities, eating habits from GPS, etc..)

### Isn’t Statistics the “Science of Data”?


* A data scientist is an expert in computer science (or a skilled programmer), understands the intricacies of statistics and probabilities, has excellent communication skills, knows how to visualize data, and has domain expertise (understanding). 

* Current statistical approaches may not always be possible. 
  * Data too large to apply statistical inference.
  * Underlying assumptions do not always apply to the data.

* Emphasis is on gaining valuable insight regardless of methods used.
  * All (many?) roads lead to Rome.
  * Much of the initial developments were happening in the industry, not academia. So the emphasis was on creative approaches, not always mathematically or statistically rigorous ones.
  
* A foundation in statistics is paramount in data science.
  * A PhD in statistics is not required.


### Python as a Principal Tool for Data Science

* Python is a general-purpose programming language.

* Supports rapid development of scripts and full-fledged applications.
  * You can write a quick analysis, a web portal (Django or Flask), or a business application quickly and (fairly) efficiently.
  
* A popular language with an abundance of resources.
  * Currently first in popularity according to Stack Overflow and Google Trends.
  
* Very simple syntax and gentle learning curve.

### Python's Popularity
<img src="https://www.dropbox.com/s/ocxi2wlrxysfd07/tiobe_2022.png?dl=1" alt="drawing" style="width:600px;"/>


### Python's Main Advantages

* Gentle learning curve
* Supported on all platforms (including the $20 Raspberry Pi)
* Used in various fields
  * Finance, biology, astronomy, oceanography, transportation, etc
* Vast and thriving user community with a vibrant ecosystem of third-party packages
    * Plethora of data science libraries in various domains.



### Sample Python Projects

<img src="https://www.dropbox.com/s/counsx56hbovgrr/python_products.png?dl=1" alt="drawing" style="width:600px;"/>

See the following for more details: https://realpython.com/world-class-companies-using-python/


### Python Covered in This Course

* We will cover Python as a tool for Data Science.
  * Not interested in OO programming or advanced features.
    * You should explore software engineering and advanced features in Python on your own.    

* We will cover in depth functionality needed to deal with data at various stages.
  * Manipulating data (ex.: read, clean and normalize data)
  * Analyzing data (ex. model data, reduce dimensionality and test hypotheses)
  * Visualize data (ex. plot distributions and inspect data)


### Python: Origin and Future

* Created by Guido Van Rossum (GVM) and first released in 1991 
  * Benevolent Dictator For Life (BDFL) and retains the final say in disputes or arguments within the community 
    * GVM retired after working for Dropbox and Google among other large
      * Back with Microsoft! 🤔

* Funded and managed by the Python Software Foundation (PSF) 
  * A 501(c)(3) non-profit corporation that holds the intellectual property rights

  * Manages the open source licensing and trademarks associated with Python
 
```
The mission of the Python Software Foundation is to promote, protect, and advance the Python programming language, 
and to support and facilitate the growth of a diverse and international community of Python programmers.
 - from the Mission Statement page
```

* The PEP-8 document provides guidelines and best practices on how to write Python code.

### Example of Python Code - 1

* List of famous people

* what are the most common first names among famour people?

| # First Name | Last Name |
| --- | ----------- |
| Steven | Gerrard | 
| Bruce | Lee |
| Barack | Obama |
| Oprah | Winfrey |
| Roger | Federer |
| James | Taylor |
| Brad | Pitt |
| Steven | Spielberg |
| Andy | Warhol |
| Andy | Roddick |



### Example of Python Code - 2

```
# First_Name\tLast_Name
# 
Steven\tGerrard
Bruce\tLee
...
```


```python
# Count and the display number of occurrences of last names from a tsv file
famous_people_collection = {}

for line in open("famous_people.tsv").read_lines():

    if line[0] != "#":        
        data = line.split("\t")
        if data[1] not in famous_people_collection:
            famous_people_collection[data[0]] = 1 
        else:
            famous_people_collection[data[0]] += 1

for name, count in famous_people_collection.items():
    print("first name is: {}, count is: {}".format(name, count))    
```

* This code is for illustration purposes only.


### Example of Python Code - 3

```python
from collections import Counter
famous_people_collection = Counter([x.split()[1] for x in open("/tmp/1").readlines()[1:]])
```

### Disadvantages of Python over other programming languages

* Speed and errors
  * No compiler to optimize code or identify errors before runtime
      * Code linters, i.e., programs that analyse code, that can detect basic errors
 
* No fine-grained control to tweak data structures easily 
  * Think for instance of pointers and memory management in C and C++
 
* A lower entry barrier that has enabled a sea of “Hero Code”!

### Python Versions

* We will be using Python 3 in this course
  * I am currently running Python 3.9.7
  * Make sure you are using Python > 3.7

### What we Will Cover

* The tools and materials we will cover will align well with the DIKW pyramid
* **DATA**: Data Wranglign and processing
* **INFORMATION**: Exploratory data analysis (data vis., summary statistics) 
* **Knowledge** Model and machine learning 
* **WISDOM**: Model validation, hypoethesis testings

* We will move back and forth and rotate between these topics

### What we Will Cover

* Data Wrangling in *Pandas*
  * Both tabular and time series data
  * From basic indexing to split-apply-combine paradigm  
  * covered predominantly throughout the first section of the course
* Data Visulization in Matplotlib and Seaborn
  * Covered throuhgout the the course
* Probability distribution, summary statistics and paramter estimation and simulation
  * Numpy, Scipy, statsmodels
  * Covered during the second, third quarters of the course
* Introductionto Machine learning Model using StatsModels and scikit-learn.
  * Emphasis on statistical learning    
  * Covered in the last quarter of the course

### FURTHER READING:

What is Data Science (Edureka)<br/>
https://www.edureka.co/blog/what-is-data-science/

The Incredible Growth of Python (David Robinson -- Stack Overfow)<br/>
https://stackoverflow.blog/2017/09/06/incredible-growth-python/

Python is becoming the world’s most popular coding language (The Economist)<br/>
https://www.economist.com/graphic-detail/2018/07/26/python-is-becoming-the-worlds-most-popular-coding-language


### Iteresting Python Resources

* How to Write Beautiful Python Code With PEP 8: https://realpython.com/python-pep8/
    
* Site with *Excellent* Python Tutorials and Demos : https://realpython.com
  * Many free tutorials + membership-based access

