<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_01-Introduction/Week_01_Overview_and_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 01 - Overview and Summary

Here's an overview and summary of the key concepts, tools, and techniques discussed, organized by topic:

**Data Science and the Data Science Lifecycle**
*   Data science is a rapidly growing field that uses a systematic approach to extract value from data.
*   The data science lifecycle has five main steps: defining the problem, data collection and preparation, data exploration and analysis, model building and evaluation, and deployment and maintenance.
*   **Defining the problem** involves understanding stakeholder requirements, framing the problem, setting success criteria, prioritizing, and documenting.
*   Data scientists need to be curious and organized, as well as proficient in statistical functions.
*   Data scientists analyze raw data, build data models, and infer results.

**Data and Data Analysis**
*   **Data** are sets of values of qualitative or quantitative variables about one or more persons or objects. Data can be transformed into information when viewed in context.
*   **Data analysis** includes quantitative analysis (statistical) which uses patterns and data visualization and qualitative analysis which produces generic information from non-data forms of media.
*   Data are measured, collected, reported, and analyzed and are used to create data visualizations such as graphs, tables, or images.
*   **Exploratory data analysis (EDA)** is used to summarize the main characteristics of a dataset, often using statistical graphics and other data visualization methods. EDA is used to explore data, find patterns, and formulate hypotheses.
*   **Data storytelling** is the skill to craft a narrative by leveraging data, which is then contextualized, and finally presented to an audience. It utilizes data analysis, statistics, data visualization, qualitative and contextual analysis, and presentation. A data story is a narrative constructed around a set of data that puts it into context and frames the broader implications.
*   Data visualizations are an essential part of data stories that help deliver various points in a narrative.
*   **Data quality** should be addressed for each individual measurement, each individual observation, and for the entire data set.

**Key Statistical Concepts**
*   **Population:** the source of data to be collected.
*   **Sample:** a portion of the population.
*  **Variable:** any data item that can be measured or counted.
*   **Descriptive statistics** summarize the characteristics of a population.
*   **Inferential statistics** makes predictions for a population.
*   **Measures of central tendency** include the mean (average), median (central value), and mode (most frequent value).

**Feature Engineering and Selection**
*  **Feature engineering** involves creating and transforming features to improve model performance. It consists of feature creation, transformation, and extraction, along with exploratory data analysis.
*   **Feature selection** is a process of selecting the most relevant features for a model.
*   Dimensionality reduction is a technique used to reduce the number of features in a dataset, improving its comprehensibility.
*   Tools like FeatureTools, TsFresh, and OneBM can be used for feature engineering.

**Machine Learning**
*   **Classification** is the activity of assigning objects to pre-existing classes or categories.
*  **Regression analysis** is a set of statistical processes for estimating the relationships among variables.
*   **Supervised learning** involves training models on labeled data, whereas **unsupervised learning** works with unlabeled data.
*   Common machine learning algorithms include Linear Regression, Logistic Regression, Decision Trees, Naive Bayes, Random Forest, Support Vector Machines, K-Means, K-Nearest Neighbors, Dimensionality Reduction, and Artificial Neural Networks.
*   **Naive Bayes** classifiers assign labels based on probability.
*   **Support Vector Machines (SVM)** are used for classification, regression, and sorting by finding the optimal hyperplane in a dataset.
*   **Large language models (LLMs)** can be used in data science to understand user queries, generate code, and enhance the interpretability of predictive AI models.
*   **Fine-tuning** is a technique used to adapt pre-trained LLMs to specific tasks, incorporating proprietary data.

**Python Libraries**
*   **Pandas** is a library for structured data operations such as importing CSV files, creating dataframes, and data preparation.
*   **NumPy** is a mathematical library for arrays, linear algebra, and Fourier transforms. NumPy is used for numerical analysis, array manipulation, descriptive statistics, and it forms the basis of other libraries.
*   **Matplotlib** is a library for creating data visualizations such as charts and graphs. It has diverse functions for creating line plots, scatter plots, histograms, pie charts, and box plots.
*   **Seaborn** is a statistical data visualization library built on Matplotlib that offers simplicity and unique features. Seaborn is better integrated with Pandas data frames.
*   **SciPy** is a library for complex mathematical calculations and scientific problems, and it also includes linear algebra modules.
*  **Scikit-learn** is a machine learning library built on NumPy and SciPy, with access to a variety of algorithms and statistical models. It is used to create visualizations based on machine learning models and for predictive analytics.

**Other Important Concepts**
*   **Data cleansing** is the process of correcting or removing inaccurate or inconsistent data.
*   **Survivorship bias** is a type of selection bias where only the successful outcomes of a process are visible.

This summary highlights the main themes and tools discussed in Week 01 - Introduction, offering a broad view of data science and related fields.


## Data

From Wikipedia (https://en.m.wikipedia.org/wiki/Data):
* A set of values of qualitative or quantitative variables about one or more persons or objects
* A datum (singular of data) is a single value of a single variable
* Although the terms "data" and "information" are often used interchangeably, data are sometimes said to be transformed into information when they are viewed in context or in post-analysis
* Data are measured, collected, reported, and analyzed, and used to create data visualizations such as graphs, tables or images
* Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing

## Science

From Wikipedia (https://en.m.wikipedia.org/wiki/Science):
* Modern science is commonly divided into three major branches: natural science, social science, and formal science
* Logic, mathematics, statistics, and computer science are listed under formal science
* Formal science is an area of study that generates knowledge using formal systems
* Formal science is a priori, knowledge which is independent of experience
* Scientific method involves using the scientific method, which seeks to objectively explain the events of nature in a reproducible way
* The steps listed for the scientific method vary from text to text but usually include, a) define the problem, b) gather background information, c) form a hypothesis, d) make observations, e) test the hypothesis, and f) draw conclusions. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1635141

## Data Science

From Wikipedia (https://en.m.wikipedia.org/wiki/Data_science):
* Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains
* SQL, Data Management, Programming, and Statistics

## Domain
* Data Analysis
* Machine Learning
* Artificial Intelligence
* Math
* Statistics
* Computer Science
* Information Technology
* Business Knowledge

## Some Tools

* R
* SPSS
* SAS
* Excel
* Power BI
* Tableau
* Python (Colab)
* SQL

## Statistical Learning vs Machine Learning

<table>
<tr>
<th>Statistical Learning</th>
<th>Machine Learning</th>
</tr>
<tr>
<td style='text-align: left;'>
<ul>
<li>Statistical significance / tests</li>
<li>Strong explanatory power</li>
<li>ANOVA, t-test</li>
<li>Model diagnostics, model building </li>
<li>Model selection (forward selection, backward selection, stepwise selection)</li>
<li>Evaluation (AIC, BIC)</li>
<li>Standard transformations (e.g. Box-Cox)</li>
<li>Residual analysis</li>
<li>Generalized linear models (GLM)</li>
<li>Longitudinal data, time series</li>
<li>Experimental design</li>
<li>Covariates, predictors, outcomes, independent / dependent variables</li>
<li>Normal distribution</li>
<li>Interpretation</li>
<li>Regression Analysis </li>
</ul>          
</td>    
<td style='text-align: left;'>
<ul>
<li>Loss function</li>
<li>Minimizing the loss function</li>
<li>Train and test sets</li>
<li>High predictive accuracy</li>
<li>Hyperparameters, generalization, overfitting, regularization</li>
<li>Inputs and outputs (targets)</li>
<li>Gaussian distribution</li>
<li>Implementation</li>
<li>Linear Regression</li>
<li>Calculus</li>
<li>Python</li>
<li>Linear algebra (vectors, matrices)</li>
<li>Probability</li>
</ul>        
</td>
</tr>
</table>

## Data Science Process

https://www.springboard.com/blog/wp-content/uploads/2022/05/data-science-life-cycle.png

* Understanding the problem and getting the data
* Data preparation and exploratory data analysis
* Feature engineering
* Feature selection
* Model selection
* Model training
* Model testing
* Model tuning
* Reporting

## Statistics

* Drawing a sample from a population and inferring characteristics from that sample to the population
* A statistic describes a sample, a parameter describes a population
* Data is a human artifact
* Limited in capacity to describe the world
* Limited because of variablility
* Much is based on Summary data - the average and deviations
* Airplane seats and airplane bullet holes
* https://dlm-econometrics.blogspot.com/2020/04/the-average-man.html
* https://en.wikipedia.org/wiki/Survivorship_bias
* Visualizations and story telling
* Distinguishing between chance and pattern
* https://en.wikipedia.org/wiki/Statistics
* https://en.wikipedia.org/wiki/History_of_statistics



## History Brief: 1700s - 1800s

* **1700:** Isaac Newton introduces an early form of linear regression analysis while studying equinoxes. He averages data, sums residuals to zero, and distinguishes between different types of data.
* **1731:** Development of the sextant improves navigation and mapping.
* **Mid-1700s:**  Taking the arithmetic mean of measurements becomes common practice in astronomy and navigation.
* **1748-1750:** Tobias Mayer uses lunar observations to determine the moon's libration and devises a method for combining inconsistent equations.
* **1749:** Leonhard Euler writes about inequalities in the motion of Saturn and Jupiter.
* **1750:** Mayer introduces the symbol ±x for errors in measurement.
* **1755:**  Christopher Maire and Roger Joseph Boscovich publish results on measuring a meridian arc.
* **1763:** James Short develops a method for averaging measurements that discounts outliers.
* **1773:** Lambert observes a reversal in the retardation of Saturn's motion.
* **1787:** Laplace extends Mayer's method for reconciling inconsistent equations.
* **1789:** Laplace adds an analytical framework to Boscovich's work, calling it the "Method of Situation."
* **1792:** Legendre joins the French commission for measuring the length of a meridian quadrant.
* **1795:** A French commission measures the meridian arc from Barcelona to Dunkirk. Gauss claims to have been using the method of least squares.
* **Late 1700s:** Scientists shift their view towards combining observations made under different conditions for comparing theory and experience.
* **1805:** Legendre publishes the method of least squares in "Nouvelles méthodes pour la détermination des orbites des comètes."
* **1809:** Gauss also publishes the method of least squares.
* **1821:** Gauss further develops the theory of least squares, including a version of the Gauss-Markov theorem.
* **1820s:** The method of least squares becomes a standard tool in astronomy and geodesy.
* **1831:** Mary Somerville describes Laplace's method as an alternative to least squares.
* **19th Century:** Francis Galton coins the term "regression" to describe a biological phenomenon.



## Sources

Data Science and Statistics, Probability, and Distributions
* https://catalog.yale.edu/ycps/subjects-of-instruction/statistics/
* https://datascience.virginia.edu/news/how-much-do-data-scientists-need-know-about-statistics
* https://www.seldon.io/supervised-vs-unsupervised-learning-explained
* https://probability4datascience.com/
* https://www.institutedata.com/us/blog/what-is-probability-theory-in-data-science/
* https://blog.dailydoseofds.com/p/11-essential-distributions-that-data
* https://ori.hhs.gov/education/products/n_illinois_u/datamanagement/dctopic.html

Data Science Process
* https://www.springboard.com/blog/wp-content/uploads/2022/05/data-science-life-cycle.png
* https://www.institutedata.com/us/blog/5-steps-in-data-science-lifecycle/
* https://en.wikipedia.org/wiki/Data_cleansing
* https://en.wikipedia.org/wiki/Exploratory_data_analysis
* https://builtin.com/articles/feature-engineering
* https://www.heavy.ai/technical-glossary/feature-selection
* https://en.wikipedia.org/wiki/Regression_analysis
* https://www.nobledesktop.com/classes-near-me/blog/top-algorithms-for-data-science
* https://en.wikipedia.org/wiki/Classification
* https://www.ibm.com/think/topics/fine-tuning
* https://www.thoughtspot.com/data-trends/best-practices/data-storytelling

Data Science and Python
* https://www.nobledesktop.com/classes-near-me/blog/why-learn-numpy-for-data-science
* https://www.nvidia.com/en-us/glossary/pandas-python/
* https://www.nobledesktop.com/classes-near-me/blog/why-learn-matplotlib-for-data-science
* https://datascientest.com/en/seaborn-everything-you-need-to-know-about-the-python-data-visualization-tool
* https://www.nobledesktop.com/classes-near-me/blog/why-learn-scikit-learn-for-data-science
* https://datascientest.com/en/scipy-all-about-the-python-machine-learning-library




## Books

* Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz
* Naked Statistics: Stripping the Dread from the Data 1st Edition by Charles Wheelan
* Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O'Neil
* The Art of Statistics: How to Learn from Data by David Spiegelhalter
* The Drunkard's Walk: How Randomness Rules Our Lives by Leonard Mlodinow, Sean Pratt, et al.
* The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World by Pedro Domingos
* The Book of Why by Judea Pearl and Dana MacKenzie
* The History of Statistics by Stephen Stigler
* The End of Average by Todd Rose
* Outliers by Malcom Gladwell
* Freakonomics by Stephen J. Dubner and Steven Levitt
* Thinking in Bets by Annie Duke
* The Signal in the Noise by Nate Silver
* Data Science for Business by Foster Provost and Tom Fawcett
* Story Telling with Data by Cole Nussbaum Knaflic

## Data Science and Python Online Books

* https://www.statlearning.com/ (all lectures on YouTube)
* https://jakevdp.github.io/PythonDataScienceHandbook/