## 1.1: Introduction to Data Science

### A History of Data Science

Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data analysis problems.

### Origins
The term "Data Science" was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland who, in 2001, wrote "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". About a year later, the International Council for Science: Committee on Data for Science and Technology started publishing the CODATA Data Science Journal beginning in April 2002. Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science.



### Development
During the "dot-com" bubble of 1998-2000, hard drives became really cheap. So corporations and governments started buying lots of them. One corollary of Parkinson's Law is that data always expands to fill the disk space available. The "disk-data" interaction is a positive exponential cycle between buying ever more disks and accumulating ever more data. This cycle produces big data. Big data is a term used to describe data sets so large and complex that they become awkward to work with using regular database management tools.

Once acquired, we have to do something with the big data besides just storing it. We need big computing architectures. Companies like Google, Yahoo!, and Amazon invented the new computing architecture, which we call cloud computing. One of the most important inventions within cloud computing is called MapReduce. MapReduce has been codified into the software known as Hadoop. We use Hadoop to do big computing on big data in the cloud.

The normal computing paradigm is that we move data to the algorithm. For example, we read data off a hard drive and load it into a spreadsheet program to process. The MapReduce computing paradigm is just the opposite. The data are so big we cannot put it all into the algorithm. Instead, we push many copies of the algorithm out to the data.
It turns out that Hadoop is difficult to do. It requires advanced computer science capabilities. This opens up a market for the creation of analytics tools - with simpler interfaces - that run on top of Hadoop. This class of tools is called "Mass Analytic Tools" - that is, tools for the analysis of massive data. Examples of these are "recommender systems, "machine learning," and "complex event processing". These tools, while having a simpler interface to Hadoop, have complex mathematical underpinnings, which also require specialization.

So, with the advent of mass analytic tools, we need people to understand the tools and actually do the analysis of big data. We call these people, "Data Scientists". These people are able to tease out new analytic insights never before possible in the world of small data. The scale of problems that are solved by analyzing big data is such that no single person can do all the data processing and analytic synthesis required. Therefore, data science is best practiced in teams.

In sum, cheap disks --> big data --> cloud computing --> mass analytic tools --> data scientists --> data science teams --> new analytic insights.



### Popularization
Mike Loukides, Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article "What is data science?" In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated by their websites.

There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences and Greenplum's Data Science Summits.

The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" increased more than 10,000 percent between January 2010 and July 2012.

 

### Academic Programs
Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University, the McCormick School of Engineering at Northwestern University, and the now-discontinued six-week summer program at the University of Illinois.

 

### Professional Organizations
A few professional organizations have sprung up recently. Data Science Central and Kaggle are two such examples. Kaggle is an interesting case. They crowdsource data science solutions to difficult problems. For example, a company will put up a hard problem with Kaggle. Data scientists from around the world sign up with Kaggle, then compete with each other to find the best solution. The company then pays for the best solution. There are over 30,000 data scientists registered with Kaggle.

 

### Case Study
In the mid- to late-1990s, AltaVista was the most popular search engine on the internet. It sent "crawlers" to extract the text from all the pages on the web. The crawlers brought the text back to AltaVista. AltaVista indexed all the text. So, when a person searched for a keyword, Altavista could find the web pages that had that word. AltaVista then presented the results as an ordered list of web pages, with the pages that had the most frequent mentions of the term at the top. This is a straightforward computer science solution, though, at the time, they solved some very difficult scaling problems.

In the late 1990s, the founders of Google invented a different way to do searches. They combined math, statistics, data engineering, advanced computation, and the hacker spirit to create a search engine that displaced AltaVista. The algorithm is known as PageRank. PageRank looks not only at the words on the page but the hyperlinks as well. PageRank assumes that an inbound hyperlink is an indicator that some other person thought the current page was important enough to put a link to it on their own page. Thus the pages with the most hyperlinks end up at the top of the list of search results. PageRank captures the human knowledge about web pages, in addition to the content.

Google not only crawled the web, it ingested the web. That is big data. They then have to calculate the PageRank algorithm across that big data. That requires massive computation. Then they have to make search and search results fast for everyone. Google search is a triumph of data science (though it was not called data science when it started).




## Understanding Data Science

The field of data science is quite diverse. Before getting into the technical details of the course, it is important to gain some perspective on how the pieces fit together. As you go through this section, remember that we are driving toward the nexus of coding implementations (in Python) for data analysis and modeling. As the course progresses, Python implementations will require a mixture of mathematical and visualization techniques. For now, use this introduction to order your understanding of the field. Watch the first 1 minute and 40 seconds of this video.

!(Video)[https://youtu.be/-L46TIh6Ctc]

## 1.2: How Data Science Works

As you immerse yourself in this introductory phase of the course, you will transition from a qualitative understanding of concepts to a more quantitative understanding. This present step involves seeing examples of what real data looks like, how it is formatted, and various approaches for dealing with analyses using mathematics and visualization. If this section is truly doing its job, you should ask yourself:

![https://youtu.be/HNC4yjeZHC0]

## The Data Science Pipeline

Now that you have some terminology and methods under your belt, we can begin to put together an understanding of a typical data science pipeline from beginning to end. Data usually comes in a raw form and so it must be curated and prepared. This is the process of data engineering. At this point, data analysis techniques such as visualization and statistical analyses should lead to some sense of what relationships exist within the data. Hence, the next step is to derive a model for the data (either by building statistical models or applying machine learning, for example). This process is repeated and refined until quantifiable measures of success have been deemed to be met.

