From a9256f3009cc1a565d31c7ab8b0cfff5749d2a9d Mon Sep 17 00:00:00 2001 From: Manikantagit Date: Tue, 21 Sep 2021 16:30:44 +0530 Subject: [PATCH] 334-planetary-computer.txt--updated --- transcripts/334-planetary-computer.txt | 207 +++++++++++++------------ 1 file changed, 108 insertions(+), 99 deletions(-) diff --git a/transcripts/334-planetary-computer.txt b/transcripts/334-planetary-computer.txt index f70ba2b3..85893f0b 100644 --- a/transcripts/334-planetary-computer.txt +++ b/transcripts/334-planetary-computer.txt @@ -1,16 +1,25 @@ -00:00:00 On this episode. Rob Emanuel and Tom Ox Berger join us to Talk about building and running Microsoft's Planetary Computer project. This project is dedicated to providing the data around climate records and the compute necessary to process it with the mission of helping us all understand climate change better. It combines multiple petabytes of data with a powerful hosted Jupiter Lab notebook environment to process it. This is Talk Python. My episode 334, recorded September 2021. +1 of 2,168 +Microsoft Planetary Computer Ep-334 +Inbox -00:00:43 Welcome to Talk Python, a weekly podcast on Python. This is your host, Michael Kitty. Follow me on Twitter where I'm at M. Kennedy and keep up with a show and listen to past episodes at Talk Python FM and follow the show on Twitter via at Talk Python. We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at Talk Python FM YouTube to get notified about upcoming shows and be part of that episode. This episode is brought to you by Shortcut, formerly known as Clubhouse IO and us over at Talk Python training and the transcripts are brought to you by assembly AI Rob Tom. +Patnam Vishwanath Srinivas +Attachments +11:24 AM (5 hours ago) +to Mani, me -00:01:21 Welcome to Talk Python. -00:01:22 Me. +Attachments area +00:00:00 On this episode. Rob Emanuel and Tom Augspurger join us to talk about building and running Microsoft's Planetary Computer project. This project is dedicated to providing the data around climate records and the compute necessary to process it with the mission of helping us all understand climate change better. It combines multiple petabytes of data with a powerful hosted Jupyter Lab notebook environment to process it. This is Talk Python to Me episode 334, recorded September 9th, 2021. + +00:00:43 Welcome to Talk Python to Me, a weekly podcast on Python. This is your host, Michael Kennedy. Follow me on Twitter where I'm '@mkennedy' and keep up with a show and listen to past episodes at 'Talk Python.FM' and follow the show on Twitter via '@talkpython'. We've started streaming most of our episodes live on YouTube, subscribe to our YouTube channel over at 'talkpython.fm/youtube' to get notified about upcoming shows and be part of that episode. This episode is brought to you by Shortcut, formerly known as Clubhouse.IO and Us over at Talk Python training and the transcripts are brought to you by 'Assembly AI', Rob, Tom. + +00:01:21 Welcome to Talk Python to Me. 00:01:22 Thank you. 00:01:23 Good to have you both here. We get to combine a bunch of fun topics and important topics. -00:01:29 Data science, Python, the cloud, big data, as in physically lots of data to deal with and then also climate change and being Proactive about studying that, make predictions and do science on huge amounts of data for sure. +00:01:29 Data science, Python, the cloud, big data, as in physically lots of data to deal with and then also climate change and being proactive about studying that, make predictions and do science on huge amounts of data for sure. 00:01:43 Look forward to it. @@ -20,11 +29,11 @@ 00:01:51 Yeah, sure. -00:01:52 Been a developer for I don't know. Let's say, 14 years. I started at a shop that was doing side based power builder that goes back way back aways. And I actually come from a math background, so I didn't know a lot about programming and started using Python, just sort of like on the side to are some bank statements and do some personal stuff and started actually integrating some of our source control at the company with Python and had to write some C extensions. So got into the Python source code and started reading that code and being like, oh, this is how programming should work. Like, this is really good code. And that year went to my first Python. +00:01:52 Been a developer for I don't know. Let's say, 14 years. I started at a shop that was doing Sybase PowerBuilder that goes back way back aways. And I actually come from a math background, so I didn't know a lot about programming and started using Python, just sort of like on the side to are some bank statements and do some personal stuff and started actually integrating some of our source control at the company with Python and had to write some C extensions. So got into the Python source code and started reading that code and being like, oh, this is how programming should work. Like, this is really good code. And that year went to my first Python. -00:02:34 It was just like, all in I need to get a different job where I'm not doing Power Builder and really kind of credit Python and the code base setting me on a better development path for sure. +00:02:34 It was just like, all in I need to get a different job where I'm not doing Power Builder and really kind of credit PyCon and the Code Base setting me on a better development path for sure. -00:02:45 Oh, that's super cool. Python is a fun experience, isn't it? Yeah, it's like my geek holiday, but sadly, the key holiday has been canceled the last two years. +00:02:45 Oh, that's super cool. PyCon is a fun experience, isn't it? Yeah, it's like my geek holiday, but sadly, the key holiday has been canceled the last two years. 00:02:54 Oh, no. @@ -34,9 +43,9 @@ 00:02:57 Kind of similar to a lot of your guests, I think was in grad school and had to pick up programming for research and simulations. This is for economics. -00:03:07 They started us on MATLAB and Fortran, and it goes back to maybe further, almost as far as you can go. And anyway, I didn't really care for MATLAB, so moved over to Python pretty quickly and then just started enjoying the data analysis side more than the research side and got into that whole open source ecosystem around Pandas and stats models in Econometric library, started contributing to open source, dropped out, got a job in data science stuff, and then moved on to Anaconda, where I worked on open source libraries like Pandas and Ask for a few years. Yeah. +00:03:07 They started us on MATLAB and Fortran, and it goes back to maybe further, almost as far as you can go. And anyway, I didn't really care for MATLAB, so moved over to Python pretty quickly and then just started enjoying the data analysis side more than the research side and got into that whole open source ecosystem around Pandas and STAT models in Econometric library, started contributing to open source, dropped out, got a job in data science stuff, and then moved on to Anaconda, where I worked on open source libraries like Pandas and DASK for a few years. Yeah. -00:03:44 In a weird turn of a coincidence. A weird coincidence. I was just the previous episode with Stan. See who you worked with over there. +00:03:44 In a weird turn of a coincidence. A weird coincidence. I was just the previous episode with Stan Siebert who you worked with over there. 00:03:51 Right at him here. Director of community innovation. And then it was a great place to work at, really enjoyed it. And then came on to this team at Microsoft almost a year ago. Now working on the planetary computer. Yeah. @@ -48,19 +57,19 @@ 00:04:19 Okay, well, it's hard to beat that, right. That's one of his like it takes up a whole room. A huge room. That's pretty fantastic. Awesome. All right. Well, what are you doing today? You're both on the planetary computer project. You're working at Microsoft. What are you doing there? -00:04:34 Yeah. So we're on a pretty small team that's building out a planetary computer, which really is sort of three components, which is a data catalog hosting a lot petabytes petabytes of data, openly licensed satellite imagery and other data sets on Azure's Blob storage. We're building API and running API services that ETL the data encode metadata according to the stack specification, which we can get into later about that those data sets putting them into a Postgres database and then building API services on top of that. That's a lot of what I do is manage the the ETL pipelines and the APIs and then expose that data to users, environmental data scientists, and really anybody. It's just publicly accessible. And that's sort of my side. And then there's a compute platform which can talk about. +00:04:34 Yeah. So we're on a pretty small team that's building out a planetary computer, which really is sort of three components, which is a Data Catalog hosting a lot petabytes petabytes of data, openly licensed satellite imagery and other data sets on Azure's Blob storage. We're building API and running API services that ETL the data encode metadata according to the stack specification, which we can get into later about that those data sets putting them into a Postgres database and then building API services on top of that. That's a lot of what I do is manage the the ETL pipelines and the APIs and then expose that data to users, environmental data scientists, and really anybody. It's just publicly accessible. And that's sort of my side. And then there's a compute platform which can talk about. 00:05:26 Yeah. So all this isn't a service of environmental sustainability. And so we have our primary users are like people who know how to code, mostly in Python, but they're not developers. 00:05:37 And so we don't want them having to worry about things like Kubernetes or whatever to set up a distributed compute cluster. -00:05:44 So that's where kind of the hub comes in. It's a place where users can go log in, get a nice, convenient computing platform built on top of Jupiter hub and ask where they can scale out to these really large workflows to do whatever analysis they need, produce whatever derived data sets they need for them to pass along to their decision makers and environmental sustainability. +00:05:44 So that's where kind of the hub comes in. It's a place where users can go log in, get a nice, convenient computing platform built on top of Jupyter hub and DASK where they can scale out to these really large workflows to do whatever analysis they need, produce whatever derived data sets they need for them to pass along to their decision makers and environmental sustainability. -00:06:07 It's super cool the platform Mailer building people who might have some Python skills, some data science skills, but not necessarily high end cloud programming, right. Handling lots of data, setting up clusters, all those kinds of things. You just push a button, end up in a notebook. The notebook is nearby. +00:06:07 It's super cool the platform you all are building people who might have some Python skills, some data science skills, but not necessarily high end cloud programming, right. Handling lots of data, setting up clusters, all those kinds of things. You just push a button, end up in a notebook. The notebook is nearby. 00:06:27 Petabytes of data. Right. -00:06:28 Right. Exactly. So we'll talk a lot about cloud native computing, data analysis. And really, what that means is just putting the compute as close as the data as possible. So in the same as our region. +00:06:28 Right. Exactly. So we'll talk a lot about cloud native computing, data analysis. And really, what that means is just putting the compute as close as the data as possible. So in the same Azure region. 00:06:40 So you just need a big hard drive. @@ -72,17 +81,17 @@ 00:06:46 Exactly. -00:06:47 It is. Yeah. So super neat. Before we get into it, though, let's just maybe talk real briefly about Microsoft and the environment. This obviously is an initiative you are putting together to help client climate scientists study the climate and whatnot. But I was really excited to see last year that you announced that Microsoft will be carbon negative by 2030. Yeah, for sure. +00:06:47 It is. Yeah. So super neat. Before we get into it, thouse, let's just maybe talk real briefly about Microsoft and the environment. This obviously is an initiative you are putting together to help client climate scientists study the climate and whatnot. But I was really excited to see last year that you announced that Microsoft will be carbon negative by 2030. Yeah, for sure. 00:07:10 I mean, Microsoft. And prior to me joining Microsoft, I didn't know any of this, but Microsoft been on the forefront of corporate efforts and environments and sustainability for a long time. And there's been an internal carbon tax that we place on business groups that there's actual payments made based on how much carbon emission each business group creates. -00:07:32 And that's been used to fund the environmental sustainability team and all these efforts. And that sort of culminated into these four focus areas and commitments that were announced in 2020. So Carbon's a big one, not just carbon negative by 2030, but by 2050, actually having removed more carbon than Microsoft has ever produced since its inception. And that's over scope one, scope two and scope three, which means accounting for downstream and upstream providers. And then there's a couple more focus areas around waste. So by 220 and 30, achieving zero waste and around water becoming water positive and ensuring accessibility to clean drinking and sanitation water for more than 1.5 million people. There's an ecosystem element two by 2025, protecting more land than we use, and then also creating a planetary computer, which is really using Azure resources in the effort to model, monitor, and ultimately manage Earth's natural systems. That's awesome. +00:07:32 And that's been used to fund the environmental sustainability team and all these efforts. And that sort of culminated into these four focus areas and commitments that were announced in 2020. So Carbon's a big one, not just carbon negative by 2030, but by 2050, actually having removed more carbon than Microsoft has ever produced since its inception. And that's over scope one, scope two and scope three, which means accounting for downstream and upstream providers. And then there's a couple more focus areas around waste. So by 2030, achieving zero waste and around water becoming water positive and ensuring accessibility to clean drinking and sanitation water for more than 1.5 million people. There's an ecosystem element too by 2025, protecting more land than we use, and then also creating a planetary computer, which is really using Azure resources in the effort to model, monitor, and ultimately manage Earth's natural systems. That's awesome. 00:08:36 That's the part you all come in, right? 00:08:38 Yeah. -00:08:38 Exactly. The planet are computers in that ecosystem commitment, and that's what we're working towards. +00:08:38 Exactly. The plants air computers in that ecosystem commitment, and that's what we're working towards. 00:08:42 Yeah. Very cool. @@ -98,7 +107,7 @@ 00:09:04 It's a lot, right. -00:09:05 Or 50 or something like that. Large data centers. I don't matter. But I read them all the time to yeah. Yeah. It's like constant. So that's a big deal. Super cool. Alright. Let's talk about this planetary computer. You told us a little bit about the motivation there, and it's made up of three parts, right. +00:09:05 Over 50 or something like that. Large data centers. I don't remember. But I read them all the time to yeah. Yeah. It's like constant. So that's a big deal. Super cool. Alright. Let's talk about this planetary computer. You told us a little bit about the motivation there, and it's made up of three parts, right. 00:09:24 Alright. @@ -116,9 +125,9 @@ 00:10:45 Yeah. So Google Earth Engine is sort of the bar that's set as far as using cloud compute resources for Earth science. -00:10:53 And it's an amazing platform that's been around for a long time. And it's really just like a giant compute cluster that has interfaces into an API and sort of like a JavaScript interface into it that you can run geospatial analytics. And so it's a great, like I said, a great tool can't sing it praises enough. One of the aspects of it that make it less useful in certain contexts is that it is a little bit of a black box, right. The operations, the geospatial operations that you can do on it, the way that you can manipulate the data are sort of whatever Google or Engine provides. +00:10:53 And it's an amazing platform that's been around for a long time. And it's really just like a giant compute cluster that has interfaces into an API and sort of like a JavaScript interface into it that you can run geospatial analytics. And so it's a great, like I said, a great tool can't sing it praises enough. One of the aspects of it that make it less useful in certain contexts is that it is a little bit of a black box, right. The operations, the geospatial operations that you can do on it, the way that you can manipulate the data are sort of whatever Google Earth engine provides. -00:11:30 If you wanted to run a Pi torch model against a large set of satellite imagery, that's a lot more difficult. You can't really do that inside a Google re engine, you have to ship data out and ship data in and getting data in and out of the system is a little a little tough because it's sort of a singular solution. And then you optimize a lot based on that. So the approach we're taking is more modular approach, leaning heavily on the open source ecosystem, the tools trying to, you know, make sure that the open source users are first class users that were thinking first, and that if people want to just use our data, we just have cloud. Ahmud, GeoTIFF, these flat file formats on Blob storage. +00:11:30 If you wanted to run a PyTorch model against a large set of satellite imagery, that's a lot more difficult. You can't really do that inside a Google Earth engine, you have to ship data out and ship data in and getting data in and out of the system is a little a little tough because it's sort of a singular solution. And then you optimize a lot based on that. So the approach we're taking is more modular approach, leaning heavily on the open source ecosystem, the tools trying to, you know, make sure that the open source users are first class users that were thinking first, and that if people want to just use our data, we just have cloud optimized GeoTIFF, these flat file formats on Blob storage. 00:12:09 Go ahead. @@ -126,11 +135,11 @@ 00:12:37 The current focuses is really on that Python data science, but yeah, considering the open source ecosystem sort of as our user experience and try to treat that as the first class use case. -00:12:49 Yeah, that's fantastic. Tell me if I have this right. I feel like my limited experience working with this is you've got these incredible amounts of data, but they're super huge. You all built these APIs that let you ask questions and filter it down into. I just want the map data for this Polygon or whatever. And then you provide a Jupiter notebook and the compute to do stuff on that result. Is that pretty pretty good. +00:12:49 Yeah, that's fantastic. Tom tell me if I have this right. I feel like my limited experience working with this is you've got these incredible amounts of data, but they're super huge. You all built these APIs that let you ask questions and filter it down into. I just want the map data for this Polygon or whatever. And then you provide a Jupyter notebook and the compute to do stuff on that result. Is that pretty pretty good. 00:13:14 Yeah. -00:13:14 That's pretty good if you just think like, the API is so crucial to have and we'll get into what it's built on. But just for, like, the Python analogy here is like, imagine that you only had lists for your data structure. You don't have dictionaries. And now you have to traverse this entire list of files to figure out where is this one, like in space on Earth? Where is it at or what time period is it covering? And the nice thing about the API is you're able to do very fast lookups over space and time with that to get down to your subset that you care about and then bring it into memory on ideally, on machines that are in the same Azure Eastern. Bring those data sets into memory using tools like Xray or Pandas and ask things like that. +00:13:14 That's pretty good if you just think like, the API is so crucial to have and we'll get into what it's built on. But just for, like, the Python analogy here is like, imagine that you only had lists for your data structure. You don't have dictionaries. And now you have to traverse this entire list of files to figure out where is this one, like in space on Earth? Where is it at or what time period is it covering? And the nice thing about the API is you're able to do very fast lookups over space and time with that to get down to your subset that you care about and then bring it into memory on ideally, on machines that are in the same Azure region. Bring those data sets into memory using tools like Xray or Pandas and DASK things like that. 00:13:59 Yeah. @@ -138,19 +147,19 @@ 00:14:00 So, Robbie, you mentioned the Postgres database. Do you parse this data and generate the metadata and all that and then store some of that information in the database. So you get to it super quick. And then you've got the raw files as Blob storage, something like that. -00:14:13 Yeah, for sure. I mean, that's as much metadata that you can cap and to describe the data so that you can kind of do what Tom said and, like, ignore the stuff that you don't care about and just get to the area that you care about. We try to extract that and we do that according to a spec that is really interesting, like Community German spec. That one of the biggest complaints about dealing with satellite imagery and this observation imagery is that it's kind of a mess. There's a lot of different scientific variables and sensor variables and things. So there's been a community effort over the past three or four years to develop specifications that make this type of information machine readable. And so we've kind of bought fully into that and have processes to look at the data, extract the stack metadata, which is just a JSON schema specification with extensions, and then write that into Postgres. +00:14:13 Yeah, for sure. I mean, that's as much metadata that you can capture and to describe the data so that you can kind of do what Tom said and, like, ignore the stuff that you don't care about and just get to the area that you care about. We try to extract that and we do that according to a spec that is really interesting, like Community German spec. That one of the biggest complaints about dealing with satellite imagery and this observation imagery is that it's kind of a mess. There's a lot of different scientific variables and sensor variables and things. So there's been a community effort over the past three or four years to develop specifications that make this type of information machine readable. And so we've kind of bought fully into that and have processes to look at the data, extract the stack metadata, which is just a JSON schema specification with extensions, and then write that into Postgres. 00:15:08 And one of the things that we have been trying to do for transparency and contribution to open source is a lot of that ETL code base. Those Python, the Python code that actually works over the files and extracts the metadata is open source in the Stack Utils GitHub organization. So we're trying to contribute to that sort of body of work of how to generate stack metadata for these different data types. -00:15:34 You want, like the metadata for the exact same image that's coming from the USGS public sector. David is set. You want the stack metadata the identical for that, whether you're using our API or Google Earth engine, who also provides a Stack API. And so we're working together on these kind of, like shared core infrastructure libraries. +00:15:34 You want, like the metadata for the exact same image that's coming from the USGS public sector data set. You want the stack metadata the identical for that, whether you're using our API or Google Earth engine, who also provides a Stack API. And so we're working together on these kind of, like shared core infrastructure libraries. -00:15:56 This portion of Talk Python Omy is brought to you by Shortcut, formerly known as Clubhouse IO. Happy with your project management tool. Most tools are either too simple for a growing engineering team to manage everything or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse. No, wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include genebased workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back height version control integration. Whether you use GitHub, GitLab or Bitbucket Club House ties directly into them so you can update progress from the command line keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash iteration, planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Give it a try over at Talk Python FM Shortcut again, that's Talk Python FM shortcut. Choose Shortcut because you shouldn't have to project manage your project management. +00:15:56 This portion of Talk Python to Me is brought to you by Shortcut, formerly known as Clubhouse.IO. Happy with your project management tool. Most tools are either too simple for a growing engineering team to manage everything or way too complex for anyone to want to use them without constant prodding. Shortcut is different, though, because it's worse. No, wait, no, I mean it's better. Shortcut is project management built specifically for software teams. It's fast, intuitive, flexible, powerful, and many other nice positive adjectives. Key features include genebased workflows. Individual teams can use default workflows or customize them to match the way they work. Org wide goals and roadmaps. The work in these workflows is automatically tied into larger company goals. It takes one click to move from a roadmap to a team's work to individual updates and back. Height version control integration. Whether you use GitHub, GitLab or Bitbucket Club House ties directly into them so you can update progress from the command line. Keyboard friendly interface. The rest of Shortcut is just as friendly as their power bar, allowing you to do virtually anything without touching your mouse. Throw that thing in the trash. Iteration-planning, set weekly priorities, and let Shortcut run the schedule for you with accompanying burndown charts and other reporting. Give it a try over at 'talkpython.fm/shortcut' again, that's 'Talk Python.FM/shortcut'. Choose Shortcut because you shouldn't have to project manage your project management. 00:17:25 Well, let's dive into some of the data, actually, and talk a little bit about all these data sets. So a lot of data, as we said, over here, maybe highlight some of the important data sets that you all have on offer. -00:17:39 So central to is our largest and is incredibly important. It's multispectral imagery, optical imagery that is ten meter resolution. So it's the highest resolution. And when we talk about satellites, we often talk about what is the resolution that's captured? Because something like land set, which we also have Landsat eight is 30 meters resolution. So once you get down to street level, you can't really see everything's blurry. +00:17:39 So Sentinel-2 is our largest and is incredibly important. It's multispectral imagery, optical imagery that is ten meter resolution. So it's the highest resolution. And when we talk about satellites, we often talk about what is the resolution that's captured? Because something like land set, which we also have Landsat eight is 30 meters resolution. So once you get down to street level, you can't really see everything's blurry. -00:18:07 It Pixel represents 30 meters on the ground. +00:18:07 each Pixel represents 30 meters on the ground. 00:18:11 Right. @@ -160,25 +169,25 @@ 00:18:28 Exactly. -00:18:29 And it's still pretty low, right resolution compared to commercially available imagery. But as far as open data sets, it's high resolution. It's passively collected. I think the revisit rate is I should have this off hand. I think it's eight days. So you can really do monitoring, use cases with that, it generates petabytes and petabytes of data. So it's a lot to sort of work over generating the stack metadata for that. +00:18:29 And it's still pretty low resolution compared to commercially available imagery. But as far as open data sets, it's high resolution. It's passively collected. I think the revisit rate is I should have this off hand. I think it's eight days. So you can really do monitoring, use cases with that, it generates petabytes and petabytes of data. So it's a lot to sort of work over generating the stack metadata for that. 00:18:53 You got to fire up, like, 10,000 cores to kind of run through that you end up actually reaching the limits of how fast you can read and write from different services. 00:19:04 My gosh. -00:19:05 Yeah, but it's a really great data set. A lot of work is being done against Cental, too. +00:19:05 Yeah, but it's a really great data set. A lot of work is being done against Sentinel-2. 00:19:11 So a lot of what I'm seeing I'm reading through here is this annually or this from 2000 to 2006 or like, the one you're just speaking about is since from 2016. So this data is getting refreshed. And and can I ask questions like, how did this Polygon a map look two years ago versus last year versus today? Totally. Yeah. 00:19:32 And you can do that with the sort of API to say. Okay, here's my Polygon of interest. This is over my house or whatever. Fetch me all the images. But a lot of satellite imagery. I mean, most of it is clouds. It's just the Earth is covered in clouds. You're going to get a lot of clouds. So there's also metadata about the cloudiness. So you can say, okay, well, give me these images over time. -00:19:54 But I want the scenes to be under cloudy, right? +00:19:54 But I want the scenes to be under 10% cloudy, right? 00:19:58 I'm willing for it to not be exactly 365 days apart, but maybe 350 because I get a clear view. If I do that something like this. -00:20:05 Exactly. And then you can make a little time lapse of how that area has changed over time. And in fact, I think there was somebody who actually demoed timelapse similar type of time lapse, just grabbing the satellite imagery and trying to get into a video over an area. Forget a very neat that was. +00:20:05 Exactly. And then you can make a little time lapse of how that area has changed over time. And in fact, I think there was somebody who actually demoed timelapse similar type of time lapse, just grabbing the satellite imagery and trying to get into a video over an area. I forget who exactly that was. -00:20:22 Yeah, that one sent in all the large one. The revisit time is every five days. That's a lot of data. +00:20:22 Yeah, that one Sentinel the large one. The revisit time is every five days. That's a lot of data. 00:20:29 Yeah. @@ -186,15 +195,15 @@ 00:20:32 A lot of cloud. -00:20:34 Yeah. What about some of these other ones here the day Met, which is graded estimates of weather parameters in North America. +00:20:34 Yeah. What about some of these other ones here the daymat, which is grided estimates of weather parameters in North America. 00:20:41 That's pretty interesting. -00:20:43 So, Dame's, actually an example of a lot of our data, geospatial, satellite imagery or things that are derived from that, like elevation data sets where you're using imagery to figure out how what's the elevation of the land or things like land cover data set. So if you scroll down just a tad, the land cover data set there that's based off that area. And so there's a saying for every Pixel in Sentinel, they took like a mosaic over a year. +00:20:43 So, 'Daymat's, actually an example of a lot of our data, geospatial, satellite imagery or things that are derived from that, like elevation data sets where you're using imagery to figure out how what's the elevation of the land or things like land cover data set. So if you scroll down just a tad, the land cover data set there that's based off Sentinel actually. And so there's a saying for every Pixel in Sentinel, they took like a mosaic over a year. -00:21:12 What is that Pixel being used for? Is it water, trees, buildings, roads, things like that. So those are examples based off of satellite imagery or aerial photography. And the day that's an example of something that's, like the output of a climate or weather model. +00:21:12 What is that Pixel being used for? Is it water, trees, buildings, roads, things like that. So those are examples based off of satellite imagery or aerial photography. And the daymats an example of something that's, like the output of a climate or weather model. -00:21:30 So these are typically higher dimensional. You're going to have things like temperature or maximum minimum temperature, water pressure, vapor, all sorts of things that are stored in this really big in dimensional Cube at various coordinates. So latitude, longitude, time, maybe height above. +00:21:30 So these are typically higher dimensional. You're going to have things like temperature or maximum minimum temperature, water pressure, vapor, all sorts of things that are stored in this really big in-dimensional Cube at various coordinates. So latitude, longitude, time, maybe height above surface. 00:21:49 So those are stored in typically in formats like ZAR, which is this cloud native very friendly to object storage way of storing chunk in dimensional arrays. @@ -214,11 +223,11 @@ 00:22:32 Okay. -00:22:32 So this is off of basically just studying light. Interesting. +00:22:32 So this is of basically just steady and light. Interesting. -00:22:35 And we have a few more that are coming online shortly, which are kind of more tabular. So, like US Census gives you the Polygon. So the state of Iowa has these counties for census blocks, which are this shape. So giving you all those shapes that it has this population, things like that. +00:22:35 And we have a few more that are coming online shortly, which are kind of more tabular. So, like US Census gives you the Polygon. So the state of IOWA has these counties for census blocks, which are this shape. So giving you all those shapes that it has this population, things like that. -00:22:53 Things like GIF has, which is I think on there now has occurrences of like, I think they're like observations of somebody spotted this animal or plant at this latitude, longitude at this time, things like that. So lots of different types of data. +00:22:53 Things like GBIF has, which is I think on there now has occurrences of like, I think they're like observations of somebody spotted this animal or plant at this latitude, longitude at this time, things like that. So lots of different types of data. 00:23:07 Interesting. A mink was spotted running through the streets. @@ -228,9 +237,9 @@ 00:23:10 You have one for agriculture. -00:23:12 That's pretty interesting. If you are doing something with agriculture and farming and then trying to do Ml against that. +00:23:12 That's pretty interesting. If you are doing something with agriculture and farming and then trying to do ML against that. -00:23:18 That's interesting, because that's actually run by the national agarage culture that's actually aerial imagery, RGB, red, green, blue, and then also infrared aerial imagery that's collected about every three years. +00:23:18 That's interesting, because that's actually run by the national agriculture that's actually aerial imagery, RGB, red, green, blue, and then also infrared aerial imagery that's collected about every three years. 00:23:31 So that's an example of high resolution imagery that I think it's 1 meter resolution. @@ -238,7 +247,7 @@ 00:23:38 You can see the little trees and stuff that it's very accurate. -00:23:42 Great data set specific to the US. So again, like Central Two is global in scope. But if you are doing things in the United States, Snape is a great data sets. It's used. +00:23:42 Great data set specific to the US. So again, likeSentinel-2 is global in scope. But if you are doing things in the United States, Snape is a great data set to use. 00:23:52 Yeah. You've got the USGS 3D elevation for topology. That's cool. @@ -252,7 +261,7 @@ 00:24:25 So for these additional ones, maybe I could directly access them out of Blob storage, but I can't ask API questions. -00:24:30 Exactly. And then another point, which is kind of interesting talking back to the tabular data is that some of these data formats aren't quite. I mean, raters and imagery is fits really nicely in stack, and we know how to do space. Your temple queries over them. But some of these data formats, they're not as mature as maybe the rest of your data format, or it's not as clear how to host them in a cloud optimized format and then host them in a space or temporal API. So we're actually having to do work to say. Okay. What are the standards? Is it like Go Park or what are the formats that we're going to be using and hosting these data sets? And then how do we actually index the metadata through the API? So there's a lot of sort of data format and specification metadata specification work before we can actually host all of these in the API. +00:24:30 Exactly. And then another point, which is kind of interesting talking back to the tabular data is that some of these data formats aren't quite. I mean, rasters and imagery is fits really nicely in stack, and we know how to do space. Your temple queries over them. But some of these data formats, they're not as mature as maybe the rest of your data format, or it's not as clear how to host them in a cloud optimized format and then host them in a space or temporal API. So we're actually having to do work to say. Okay. What are the standards? Is it like geo Parquet or what are the formats that we're going to be using and hosting these data sets? And then how do we actually index the metadata through the API? So there's a lot of sort of data format and specification metadata specification work before we can actually host all of these in the API. 00:25:15 Really nice. A lot of good data here and quite large. Let's talk about the ETL for just a minute because you through out some crazy numbers there. We're looking at the Sentinel Two data, and it gets refreshed every five days. And it's the Earth talk us through what has to happen there. @@ -264,13 +273,13 @@ 00:25:42 And so that comes off to ground stations through the European Space Agency. -00:25:46 And then we have some partners who are taking that, converting it to the cloud Optimas GeoIP format, putting it on Blob storage, at which point we run our ingest pipeline, look for new imagery, extract the stack metadata, insert that into the database. And we just have that running in an Azure service called Azure batch which allows us to run parallel tasks on clusters that can auto scale. So if we're doing an ingest of a data set for the first time, there's going to be a lot of files to process, and we can scale scale that up, and it runs Docker container. So we just have a project that defines the Docker commands that can run. And then we can submit tasks for chunks of the files that we are processing that creates the stack items. And then another separate process actually takes the stack items and insert it into the database. +00:25:46 And then we have some partners who are taking that, converting it to the cloud Optimus Geotiff format, putting it on Blob storage, at which point we run our ingest pipeline, look for new imagery, extract the stack metadata, insert that into the database. And we just have that running in an Azure service called Azure Batch which allows us to run parallel tasks on clusters that can auto scale. So if we're doing an ingest of a data set for the first time, there's going to be a lot of files to process, and we can scale scale that up, and it runs Docker container. So we just have a project that defines the Docker commands that can run. And then we can submit tasks for chunks of the files that we are processing that creates the stack items. And then another separate process actually takes the stack items and insert it into the database. 00:26:37 That's cool. So it's a little bit like data driven rather than a little bit like Azure functions or AWS Lamda. 00:26:45 But processing, we just kind of get all the data and just work through it kind of at scale. Interesting. -00:26:51 Yeah, for sure. Right now, it's a little bit. We're still building a plane as we're flying it. But the next iteration is actually going to be a lot more reactive. And based on another Azure service called defent Grid, where you can get notifications of new Blobs going into storage and then put messages into cues that can then turn into these Azure batch tasks that are run. Right. +00:26:51 Yeah, for sure. Right now, it's a little bit. We're still building a plane as we're flying it. But the next iteration is actually going to be a lot more reactive. And based on another Azure service called Event Grid, where you can get notifications of new Blobs going into storage and then put messages into cues that can then turn into these Azure batch tasks that are run. Right. 00:27:13 I see. So you just get something that drops it in the Blob storage, and it kicks off everything from there, and you have to worry about it. @@ -280,7 +289,7 @@ 00:27:27 Oh, that's cool. There's a way to get notified of refreshes and things like that. -00:27:32 Not yet. We're hoping to get that ended, but the idea is that we would have basically a live feed of new imagery. +00:27:32 Not yet. We're hoping to get that end of the year, but the idea is that we would have basically a live feed of new imagery. 00:27:39 What I would really like to see just for myself. My own interest is to be able to have my areas of interest and then just go to a page that shows, like, almost an Instagram feed of Sentinel images over that area. It's like this new one is not cloudy. Look at that one. It's something I'm monitoring, but yeah, generally will be publishing new stack items, so that if you're running AI models off of the imagery as it comes in, you can do that processing based off of events. Yeah. @@ -296,11 +305,11 @@ 00:29:04 It's a cool, like JavaScript application that you can view risk on. -00:29:09 These companies are buying like carbon Al sets that are for trees that are planted to offset carbon. But there's a problem that we know about now is the wildfires are burning down some of those forests. +00:29:09 These companies are buying like carbon assets that are for trees that are planted to offset carbon. But there's a problem that we know about now is the wildfires are burning down some of those forests. 00:29:25 And it doesn't help if you planted a bunch of trees, offset your carbon if they go up and smoke. Right. -00:29:30 Right. Yeah. So Carbon Plan did a bunch of research, first of all, on essentially, they did the research before our up existed. But we're working with these community members, and they have a very similar set up to what we have now to do the research to train the models and all that goes into this visualization here of how likely what are the different risks for each plot of land in the US. So that was a great collaboration there. +00:29:30 Right. Yeah. So Carbon Plan did a bunch of research, first of all, on essentially, they did the research before our hub existed. But we're working with these community members, and they have a very similar set up to what we have now to do the research to train the models and all that goes into this visualization here of how likely what are the different risks for each plot of land in the US. So that was a great collaboration there. 00:29:59 What are the things I was wondering when I was looking at these is you all are hosting these large amounts of data and you're offering compute to study them. How does something like Carbon Plan take that data and build this seemingly independent website? Does that run directly on that data. @@ -330,11 +339,11 @@ 00:32:46 The grants are great for if you have a complex deployment that's using a ton of Azure services and you want to integrate this all together and use the planetary computer data, then the grants are a great approach. If you're just like an individual researcher, a team of researchers, or whoever who wants to use this data, the data is there. It's publicly accessible. And if you need a place to compute from that's in Azure, so close to the data and you don't already have an Azure subscription, then you can sign up for a planetary computer account. And so that's a way lower bar of barrier to entry. There is you just sign up for an account, you get a approved by us, and then you're off to the races. -00:33:26 Er point. +00:33:26 thats right point. -00:33:27 If you think you need a grant to use the cloud, try using the plants to our computer first, because you might not. +00:33:27 If you think you need a grant to use the cloud, try using the planetary computer first, because you might not. -00:33:33 Very good talk Python Amy is partially supported by our training courses. When you need to learn something new, whether it's foundational Python advanced topics like a sync or Web apps and Web APIs, be sure to check out our over 200 hours of courses at the How Python. And if your company is considering how they'll get up to speed on Python, please recommend they give our content a luck. Thanks. +00:33:33 Very good. Talk Python to Me is partially supported by our training courses. When you need to learn something new, whether it's foundational Python advanced topics like a sync or Web apps and Web APIs, be sure to check out our over 200 hours of courses at the How Python. And if your company is considering how they'll get up to speed on Python, please recommend they give our content a luck. Thanks. 00:33:58 So what's the business model around this? Is there going to be a fee for it? Is there some free level? Is it always free but restricted how you can use it? Because right now it's in a private beta. Right. I can come down and request access to it. @@ -366,27 +375,27 @@ 00:36:35 Yeah. -00:36:35 It's actually going to fire up a machine, and it gives you four choices, right? Python with four cores and 32 gigs of memory and a Pang notebook gives you R with eight cores and Geospatial and GPU Pi torch as well as is which I don't really know what that is. We started it. +00:36:35 It's actually going to fire up a machine, and it gives you four choices, right? Python with four cores and 32 gigs of memory and a Pangea notebook gives you R with eight cores and R Geospatial and GPU PyTorch as well as QGIS which I don't really know what that is. We started it. -00:36:55 So this is a Jupiter Hub deployment. So Jupiter Hubs, this really nice project. I think it came out of UC Berkeley when they were kind of teaching classes, data science courses to, like, thousands of students at once, even with, like, condo or whatever. You don't want to be trying to manage a thousand students, condo installations or whatever. +00:36:55 So this is a Jupyter Hub deployment. So Jupyter Hubs, this really nice project. I think it came out of UC Berkeley when they were kind of teaching classes, data science courses to, like, thousands of students at once, even with, like, condo or whatever. You don't want to be trying to manage a thousand students, condo installations or whatever. 00:37:16 So that's just a nightmare. So they had this kind of cloud based set up where you just log in with your credentials or whatever. You get access to a computer environment to do your homework in that case, or do your geospatial data analysis in this case. -00:37:31 And so this kind of you mentioned in geo. This is the ecosystem of geo businesses, geo scientists who are trying to do scalable Geoscience on the cloud that Anaconda was involved with. And so they kind of pioneered this concept of a Jupiter Hub deployment on Kubernetes that's tied to Ask so you can create easily get a single node compute environment here, in this case is in the Python environment or multiple nodes, a cluster of machines to do your analysis using Task and Dask gateway. Yeah. Let's just say a Kubernetes based computing environments. That's cool. +00:37:31 And so this kind of you mentioned in pangeo. This is the ecosystem of geo businesses, geo scientists who are trying to do scalable Geoscience on the cloud that Anaconda was involved with. And so they kind of pioneered this concept of a Jupyter Hub deployment on Kubernetes that's tied to DASK so you can create easily get a single node compute environment here, in this case is in the Python environment or multiple nodes, a cluster of machines to do your analysis using DASK and Dask gateway. Yeah. Let's just say a Kubernetes based computing environments. That's cool. -00:38:07 And I noticed right away the desk integration, which is good for this massive amounts of data. Right. Because it allows you to scale across machines or stream data where you don't have enough to store a memory and things like that. +00:38:07 And I noticed right away the DASK integration, which is good for this massive amounts of data. Right. Because it allows you to scale across machines or stream data where you don't have enough to store a memory and things like that. 00:38:19 Yeah. Exactly. So this is a great thing that we get for Python Sedaka's, Python specific. We do have the other environments like R. -00:38:28 If you're doing Geospatial and R, which there's a lot of really great libraries there, that's an option that is unfortunately, single note. There's not really a Das equivalent there, but there's some cool stuff that's being worked on, like multiplier and things like that. +00:38:28 If you're doing Geospatial and R, which there's a lot of really great libraries there, that's an option that is unfortunately, single note. There's not really a DASK equivalent there, but there's some cool stuff that's being worked on, like multi-d player and things like that. -00:38:43 Cool. People haven't seen Dask running and Jupiter notebook. There's the whole cluster visualization and the sort of progress computation stuff is super neat to see it go. +00:38:43 Cool. People haven't seen DASK running in Jupyter notebook. There's the whole cluster visualization and the sort of progress computation stuff is super neat to see it go. 00:38:53 Yeah. So when you're doing these distributed computations, it's really key to have and the understanding of what your cluster is up to. It's crucial to be able to have that information there. -00:39:05 And the example code that you've got there the cloudless mosaic seal to notebook. It just has basic. +00:39:05 And the example code that you've got there the cloudless mosaic Sentinel-2 notebook. It just has basic. -00:39:14 Create a cluster and ask get the client, create four to 24 workers, and then office goes, right. +00:39:14 Create a cluster in DASK get the client, create four to 24 workers, and then office goes, right. 00:39:23 Yeah. Exactly. @@ -396,9 +405,9 @@ 00:39:55 That's real computing, right. -00:39:56 There definitely a and in this case, we're using Desk adaptive mode. So we're saying right now there's nothing to do. It's just sitting around Idly. So I have three or four workers. +00:39:56 There definitely a and in this case, we're using DASK adaptive mode. So we're saying right now there's nothing to do. It's just sitting around Idly. So I have three or four workers. -00:40:08 But once I start to actually do a computation that's using desk, it'll automatically scale up in the background, which is a neat feature of desk. And so the basic computation, the problem that we're trying to do here is we have some area of interest, which I think is over. Redmond Washington, Microsoft headquarters, which we're defining as this out exact square area. +00:40:08 But once I start to actually do a computation that's using DASK, it'll automatically scale up in the background, which is a neat feature of DASK. And so the basic computation, the problem that we're trying to do here is we have some area of interest, which I think is over. Redmond Washington, Microsoft headquarters, which we're defining as this out exact square area. 00:40:28 Yeah. @@ -406,13 +415,13 @@ 00:40:33 Anyway, we draw that out and then we say, okay, give me all of the Sentinel two items that cover that area. So again, back to what we're talking about at the start is like if you just had files and blob storage, I'd be extremely difficult to do. -00:40:48 But thanks to this nice stack API, which we can connect to here at Planetary Computer, Microsoft. Com, we're able to quickly say, hey, give me all the images from 2016 to 2020 from Sentinel that cover that intersect with our area of interest here. And we're even throwing in a query here saying, hey, I only want scenes where the cloud cover is less than 25%, according to the metadata. +00:40:48 But thanks to this nice stack API, which we can connect to here at Planetary Computer.Microsoft.Com, we're able to quickly say, hey, give me all the images from 2016 to 2020 from Sentinel that cover that intersect with our area of interest here. And we're even throwing in a query here saying, hey, I only want scenes where the cloud cover is less than 25%, according to the metadata. 00:41:13 Very likely summer in Seattle. -00:41:15 Because the way not so much fewer much here quickly. Within a second or two, we get back the 138 scenes items out of the I don't know how many there are in total, but like, hundreds of thousands, millions of individual stack items that million 20 million. Okay, that can prize settle to. So we're quickly able to filter that down. +00:41:15 Because the way not so much fewer much here quickly. Within a second or two, we get back the 138 scenes items out of the I don't know how many there are in total, but like, hundreds of thousands, millions of individual stack items that million 20 million. Okay, that comprise settle to. So we're quickly able to filter that down. -00:41:39 Next up, we have a bit of signing. So this is that it that we talked about where you can do all this in ano smoothie. But in order to actually access the data, we have you sign the items, which basically opens this little token to the URLs, and then at that point, they can be opened up by any geospatial program like Jugs or a private private black storage URL to a temporary public one. +00:41:39 Next up, we have a bit of signing. So this is that it that we talked about where you can do all this in ano smoothie. But in order to actually access the data, we have you sign the items, which basically opens this little token to the URLs, and then at that point, they can be opened up by any geospatial program like QGIS or a private private black storage URL to a temporary public one. 00:42:03 Yeah. @@ -422,7 +431,7 @@ 00:42:04 So you do that. -00:42:05 It's just like this kind of incidental happenstance that stack and ask actually pair extremely nicely if you think about as the way it operates is it's all about lately operating lately, constructing a task graph of computations, and then at the end of your whatever you're doing computing, that all at once. That just gives really nice rooms for optimizations and maximizing parallelization wherever possible. The thing about geospatial is again, if you didn't have stack, you'd have to open up these files to understand where on Earth is it? What latitude, longitude does it cover? +00:42:05 It's just like this kind of incidental happenstance that stack and DASK actually pair extremely nicely if you think about DASK as the way it operates is it's all about lately operating lately, constructing a task graph of computations, and then at the end of your whatever you're doing computing, that all at once. That just gives really nice rooms for optimizations and maximizing parallelization wherever possible. The thing about geospatial is again, if you didn't have stack, you'd have to open up these files to understand where on Earth is it? What latitude, longitude does it cover? 00:42:43 What? I have all 20 million files and then look and see what its metadata, isn't it right? @@ -430,17 +439,17 @@ 00:42:48 Okay. And in this case, we have, like, 138 times three files. So whatever, 400 and 5600 items files here, opening each one of those takes a few, maybe 200, 400, 500 milliseconds. So it's not awful, but it's like, too slow to really do interactively on any scale of any large number of stack items. -00:43:12 That's where stack scree. It has all the metadata. So we know that this tip file, this plot optimized go tip file that contains the actual data. We know exactly where it is on Earth, what latitude, longitude it covers, what time period it covers, what asset it actually represents, wavelength. So we're able to very quickly stack these together into this X ray data. +00:43:12 That's where stack scree. It has all the metadata. So we know that this tip file, this plot optimized geotiff file that contains the actual data. We know exactly where it is on Earth, what latitude, longitude it covers, what time period it covers, what asset it actually represents, wavelength. So we're able to very quickly stack these together into this X ray data. -00:43:33 In this case, it's fairly small since we've chopped it down. If we leave out the filtering, it'd be much, much larger because these are really large scenes. But anyway, we're able to really quickly generate these data arrays. And then using task using our Das cluster, we can actually load those persist, those in distributed memory on all the workers on our cluster. So that's, like, very cool, very easy. It's like a few lines of code, a single function call. But it represents years of effort to build up the stack specification and all the metadata and then the integration in the desk. +00:43:33 In this case, it's fairly small since we've chopped it down. If we leave out the filtering, it'd be much, much larger because these are really large scenes. But anyway, we're able to really quickly generate these data arrays. And then using DASK using our DASK cluster, we can actually load those persist, those in distributed memory on all the workers on our cluster. So that's, like, very cool, very easy. It's like a few lines of code, a single function call. But it represents years of effort to build up the stack specification and all the metadata and then the integration in the desk. -00:44:07 So it's just a fantastic result that we have even cool what you just call data persist on the desk array. +00:44:07 So it's just a fantastic result that we have even cool once you just call data persist on the DASK array. -00:44:17 You could just see in the dashboard of Das, like, all these clusters firing up and all this data getting processed? +00:44:17 You could just see in the dashboard of DASK, like, all these clusters firing up and all this data getting processed? -00:44:25 Yes, exactly. So in this case, since we have that adaptive mode, we'll see additional workers come online here as we start to stress the cluster and saying, oh, I've got a bunch of unfinished tasks. I should bring online some more workers and Tattle team, either a few seconds if there's empty space on our cluster or a bit longer. +00:44:25 Yes, exactly. So in this case, since we have that adaptive mode, we'll see additional workers come online here as we start to stress the cluster and saying, oh, I've got a bunch of unfinished tasks. I should bring online some more workers and that will be a team, either a few seconds if there's empty space on our cluster or a bit longer. -00:44:43 Yeah, I feel like with this. If it just sat there and said, it's going to take two minutes and just one with a little star. The Jupiter star that would be both made a little dash. More like, I want to just watch it go. +00:44:43 Yeah, I feel like with this. If it just sat there and said, it's going to take two minutes and just spun with a little star. The Jupyter star that would be both made a little DASK. More like, I want to just watch it go. 00:44:54 Look at that guy. @@ -468,7 +477,7 @@ 00:47:05 Yeah. Yeah. -00:47:05 Very cool. All right, Tim, your graph stopped moving around. It might be done. +00:47:05 Very cool. All right, Tom, your graph stopped moving around. It might be done. 00:47:10 Yeah. So we spent quite a while Loading up the data. And then that's just how it goes. You spend a bunch of time Loading up data and then once it's in memory, computations tend to be pretty quick. So in this case, we're taking a median over time. @@ -488,29 +497,29 @@ 00:48:31 Looks like maybe that Lake Washington. And you got Reneer there and all sorts of good stuff. -00:48:38 Yeah, I'm sure I actually do not know the GRP that well, but I have been looking at lots of pictures. We tend to use this as our example area. A lot super cool. +00:48:38 Yeah, I'm sure I actually do not know the geography that well, but I have been looking at lots of pictures. We tend to use this as our example area. A lot super cool. 00:48:49 And one nice thing here is where, again, investing heavily in open source, investing in building off of open source. So we have all the power of X ray to use X ray. Is this very general purpose individual array computing library kind of combines the best of NumPy and Pandas. In this case, we can do something like group by. So if you're familiar with Pandas, your affiliates group bys. We can group by time, month. -00:49:14 I want to do, like a monthly mosaic. Maybe I don't want to combine images from January, which might have known them with images from July, which won't have as much. So I can do it. +00:49:14 I want to do, like a monthly mosaic. Maybe I don't want to combine images from January, which might have snow on them with images from July, which won't have as much. So I can do it. 00:49:24 We have, like, twelve different images or something like that. Here's what it kind of averaged out to be in February. -00:49:30 Exactly. And so now we have a stack of images, twelve of them, and we can go ahead and representing a median. So we have multiple years and we group all of the ones from January together and take the median of those. And then we get nice little group of cloud free mistakes here one for each month. Yeah. +00:49:30 Exactly. And so now we have a stack of images, twelve of them, and we can go ahead and representing a median. So we have multiple years and we group all of the ones from January together and take the median of those. And then we get nice little group of cloud free mosaics here one for each month. Yeah. -00:49:48 Sure enough, there is a little less snow around Reneer summer than in the winter, as you would do the Cascades. +00:49:48 Sure enough, there is a little less snow around rainier summer than in the winter, as you would do the Cascades. -00:49:53 Yep. Definitely. So that's like a fun little introductory example to what the hug gives. You use single node environment, which that alone is quite a bit. You don't have to mess with fighting to get the right set of libraries installed, which can be especially challenging when you're interfacing with the C and C Plus plus libraries like all. So that environment is all set up mostly compatible. Should all work for you on a single node. And then if you do have these larger computations, we saw it took a decent while to load the data, even with these fast Interop between the storage machines and the compute machines and the same as our region. But you can scale that out on enough machines that your computations complete in a reasonable amount of time because of the animations. +00:49:53 Yep. Definitely. So that's like a fun little introductory example to what the hub gives. You use single node environment, which that alone is quite a bit. You don't have to mess with fighting to get the right set of libraries installed, which can be especially challenging when you're interfacing with the C and C++ libraries like all. So that environment is all set up mostly compatible. Should all work for you on a single node. And then if you do have these larger computations, we saw it took a decent while to load the data, even with these fast Interop between the storage machines and the compute machines and the same as our region. But you can scale that out on enough machines that your computations complete in a reasonable amount of time because of the animations. 00:50:35 You don't even mind it. It's super cool. So you use the API to really narrow it down from 20 million to 150 or 138 images and then keep it out. So one thing that I was wondering when I was looking at this is what libraries come included that I can import and which ones. -00:50:55 If there's something that's not there, maybe I really want to use Http and you only have requests or whatever. Is there a way to get additional libraries and packages and stuff in there. +00:50:55 If there's something that's not there, maybe I really want to use Httpx and you only have requests or whatever. Is there a way to get additional libraries and packages and stuff in there. -00:51:03 We do have a focus on geospatial, so that's like we'll have most of that there already. So Xray desk rest area and all those things. But if there is something there our container. +00:51:03 We do have a focus on geospatial, so that's like we'll have most of that there already. So Xray DASK, REST area and all those things. But if there is something there our container. -00:51:16 So these are all Docker images built from Conda environments. That all comes from this repository, Microsoft Planetary Computer containers. So if you want TP, you add it to the environment and we'll get a new image built and then available from the planetary computer. And so these are public images. They're just on the Microsoft container registry. So if you want to use our image like you don't want to fight with getting a compatible version of, say, Pi Torch and Lib JPEG. Not that I was doing that recently, but if you want to avoid that pain, then you can just use our images locally, like from your laptop, and you can even connect to our desk gateway using our images from your local laptop and do some really fun setups there. +00:51:16 So these are all Docker images built from Conda environments. That all comes from this repository, Microsoft/Planetary Computer containers. So if you want httpx, you add it to the environment and we'll get a new image built and then available from the planetary computer. And so these are public images. They're just on the Microsoft container registry. So if you want to use our image like you don't want to fight with getting a compatible version of, say, PyTorch and Lib JPEG. Not that I was doing that recently, but if you want to avoid that pain, then you can just use our images locally, like from your laptop, and you can even connect to our DASK gateway using our images from your local laptop and do some really fun setups there. -00:52:02 Yeah, I see, because most of the work would be happening in the clusters. The data clusters, not locally anyway. +00:52:02 Yeah, I see, because most of the work would be happening in the clusters. The DASK clusters, not locally anyway. 00:52:08 Yeah. So all the compute happens there, and then you bring back this little image. That's your plot, your result. @@ -522,13 +531,13 @@ 00:52:34 So you see, there's quite a few packages in here already. -00:52:37 And those are just the ones we explicitly asked for, and then all their dependencies get pulled into a lock file, and they built into a Docker images. So this is building off of projects from NGO, that group of geo scientists that I mentioned earlier who have been struggling with this problem for several years now. So they have a really nice Docker eyes set up, and we're just building off that base image. +00:52:37 And those are just the ones we explicitly asked for, and then all their dependencies get pulled into a lock file, and they built into a Docker images. So this is building off of projects from NGEO, that group of geo scientists that I mentioned earlier who have been struggling with this problem for several years now. So they have a really nice Docker eyes set up, and we're just building off that base image. 00:52:59 Cool. Based on the NGO container. Very cool. Simapari asks, how long is the temporary URL active for the signed URL, the Blob storage. 00:53:08 So that actually depends on whether or not you're authenticated. We have some controls to say the plan check computer hub requires access, but also you get an API token, which gives you a little bit longer lasting tokens. -00:53:21 But forget what the actual current expiries are. If you use the Planter Computer Python library, you just Pip install plans are for computer and use that sign method. It will actually request a token. And then as the token is going to expire, request a new token. So it reasons token and caches it. But it should be long enough for actually pulling down the data files that we have available. Right. Because we're working smaller cloud optimized formats. There aren't these 100 gig files that you should have to pull down and need a single SAS token to last for a really long time, so you can re request if you need a new one as it expires. And like I said, that library actually takes care of the logic for you there. +00:53:21 But forget what the actual current expiries are. If you use the Planter Computer Python library, you just Pip install planetary_computer and use that dot sign method. It will actually request a token. And then as the token is going to expire, request a new token. So it reasons token and caches it. But it should be long enough for actually pulling down the data files that we have available. Right. Because we're working smaller cloud optimized formats. There aren't these 100 gig files that you should have to pull down and need a single SAS token to last for a really long time, so you can re request if you need a new one as it expires. And like I said, that library actually takes care of the logic for you there. 00:54:08 That's cool. Yeah, very nice. All right, guys. Really good work with this. And it seems like it's early days. It seems like it's getting started. There's probably gonna be a lot more going on with this. @@ -536,17 +545,17 @@ 00:54:18 I'm gonna go out on a limb and make a big prediction that understanding the climate climate change is going to be more important, not less important in the future. So I suspect that's also going to grow some interest. -00:54:31 In the new report. Fcc is making some heavy predictions, and within the decade, we might reach plus one five Celsius and we're already in it. We're already feeling the effects. And this is the data about our Earth and it's going to become more and more important as we mitigate and adapt to these effects. So yeah, I agree. I think that's a good question. +00:54:31 In the new report. FCC is making some heavy predictions, and within the decade, we might reach plus 1.5 Celsius and we're already in it. We're already feeling the effects. And this is the data about our Earth and it's going to become more and more important as we mitigate and adapt to these effects. So yeah, I agree. I think that's a good question. 00:54:56 Thanks. If we are going to plan our way out of it and plan for the future and science our way out of it, what could I need? Stuff like this so well done. 00:55:06 All right. -00:55:06 I think we're about out of time. So let me ask you both the final two questions here if you're going to write some Python code. What editor do you use? Rob Vs code? I suspect I could guess that, but yeah, yeah. +00:55:06 I think we're about out of time. So let me ask you both the final two questions here if you're going to write some Python code. What editor do you use? Rob Vs code I suspect I could guess that, but yeah, yeah. -00:55:17 Actually, I was a big Mac user, and then when I got this switched over to Vs code, I just integrated better with Windows and then really got into the plane and the typing system doing type annotations and basically having a compiler for the Python code really change instead of having all of the types in my head and having to worry about all that, actually having the type hinting was something I wasn't doing a year ago, and now it's drastically improved my development experience. +00:55:17 Actually, I was a big Emacs user, and then when I got this switched over to VS code, I just integrated better with Windows and then really got into the Pylance and the typing system doing type annotations and basically having a compiler for the Python code really change instead of having all of the types in my head and having to worry about all that, actually having the type hinting was something I wasn't doing a year ago, and now it's drastically improved my development experience. -00:55:46 It's a huge difference. And I'm all about that as well. People talk about the type speeds super important for things like my pie and other stuff and a lot of cases it can be. But to me, the primary use case is when I hit dot after a thing. I wanted to tell me what I can do. If I have to go to the documentation. +00:55:46 It's a huge difference. And I'm all about that as well. People talk about the type speeds super important for things like MyPy and other stuff and a lot of cases it can be. But to me, the primary use case is when I hit dot after a thing. I wanted to tell me what I can do. If I have to go to the documentation. 00:56:05 Then it's kind of like something that's failing. @@ -560,21 +569,21 @@ 00:56:35 Yeah, totally. -00:56:36 All right, Tom, how about you vs code as well for most stuff and then Emacs for Mega Match. It the Get client and then a bit of them every now and then. +00:56:36 All right, Tom, how about you VS code as well for most stuff and then Emacs for Mega Magic the Git client and then a bit of them every now and then. -00:56:45 Right on. Very cool. Alright, then the other question is for either of you there's like a cool notable IPI or condo package that I came across. This. It was amazing. People should know about it. Any idea how you going? Sure. +00:56:45 Right on. Very cool. Alright, then the other question is for either of you there's like a cool notable PyPI or condo package that I came across. This. It was amazing. People should know about it. Any idea how you going? Sure. -00:56:59 I'll go Seabourn. It's plotting library from Michael Wakem. Built on top of Matplotlib. It's just really great for exploratory data analysis easily create these great visualizations for mostly tabular data sets, but not exclusively. +00:56:59 I'll go Seaborn. It's plotting library from Michael Wakem. Built on top of Matplotlib. It's just really great for exploratory data analysis easily create these great visualizations for mostly tabular data sets, but not exclusively. -00:57:15 That's interesting. I know a Seabourn new Matplotlib. I didn't realize that Seabourn was like, let's make Matplotlib easier. +00:57:15 That's interesting. I know a Seabourn new Matplotlib. I didn't realize that Seaborn was like, let's make Matplotlib easier. 00:57:21 Yeah, essentially for this very specific use case. -00:57:24 Matplotlib is extremely flexible, but there's a lot of boilerplate and Seabourn just wraps that all up nicely. +00:57:24 Matplotlib is extremely flexible, but there's a lot of boilerplate and Seaborn just wraps that all up nicely. 00:57:30 Yeah, super cool. All right. Well, thank you so much for being here. Final call to action. People wanna get started with Microsoft Planetary Computer. Maybe they've got some climate research. -00:57:38 What do they do into computer? Microsoft. Com that'll get you anywhere you need to go and then if you want an account, that it's account request, I believe. Yeah. +00:57:38 What do they do planetarycomputer.microsoft.Com that'll get you anywhere you need to go and then if you want an account, then it's /account/request, I believe. Yeah. 00:57:47 There's a big request access to so you can click. That awesome. @@ -590,20 +599,20 @@ 00:57:58 Bye. -00:57:59 This has been another episode of Talk Python to me. Our guest in this episode were Rob Emmanuel and Tom Ox. Burger has been brought to you by Shortcut, formerly Clubhouse IO us over at Dock Python training and the transcripts are brought to you by assembly AI. +00:57:59 This has been another episode of Talk Python to me. Our guest in this episode were Rob Emmanuel and Tom Augspurger. This has been brought to you by Shortcut, formerly Clubhouse.IO us over at Dock Python training and the transcripts are brought to you by 'Assembly AI'. -00:58:15 Choose Shortcut, formerly Clubhouse IO for tracking all of your projects work because you shouldn't have to project manage your project management. Visit Pack Python FM Shortcut Do you need a great automatic speechtotext API? +00:58:15 Choose Shortcut, formerly Clubhouse.IO for tracking all of your projects work because you shouldn't have to project manage your project management. Visit 'talkpython.fm/shortcut' Do you need a great automatic speechtotext API? 00:58:29 Get human level accuracy in just a few lines of code? -00:58:32 Visit Talk Python FM assembly AI when you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all, there's not a subscription in sight. Check it out for yourself at training to Python FM be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top. +00:58:32 Visit 'talkpython.fm/assemblyAI' when you level up your Python, we have one of the largest catalogs of Python video courses over at Talk Python. Our content ranges from true beginners to deeply advanced topics like memory and async and best of all, there's not a subscription in sight. Check it out for yourself at 'training.talkpython.fm' be sure to subscribe to the show. Open your favorite podcast app and search for Python. We should be right at the top. -00:58:57 You can also find the itunes feed at itunes, the Google Play feed at Play and the Direct RSS feed at RSS on Talk Python FM. +00:58:57 You can also find the itunes feed at /itunes, the Google Play feed at /Play and the Direct RSS feed at /RSS on Talk Python FM. -00:59:06 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air. Be sure to subscribe to our YouTube channel at Talk Python Film YouTube. +00:59:06 We're live streaming most of our recordings these days. If you want to be part of the show and have your comments featured on the air. Be sure to subscribe to our YouTube channel at 'talkpython.fm/youtube'. 00:59:18 This is your host, Michael Kennedy. Thanks so much for listening. 00:59:21 I really appreciate it. -00:59:22 Now get out there and write some Python code. \ No newline at end of file +00:59:22 Now get out there and write some Python code.